fix: drop obsolete release smoke check

chore: bump version to 4.3.0
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-03 16:05:26 +08:00 · 2026-05-02 04:19:23 +08:00 · 2026-05-02 03:55:36 +08:00 · 2026-05-02 03:26:43 +08:00 · 2026-05-02 02:40:57 +08:00 · 2026-05-02 02:31:24 +08:00
162 changed files with 16175 additions and 4884 deletions
--- a/.github/workflows/release-artifacts.yml
+++ b/.github/workflows/release-artifacts.yml
@@ -47,7 +47,6 @@ jobs:

      - name: Release Blocking Gates
        run: |
-          ./tests/scripts/check-stage6-manual-smoke.sh
          ./tests/scripts/check-refactor-line-gate.sh
          ./tests/scripts/run-unit-all.sh

--- a/.gitignore
+++ b/.gitignore
@@ -29,6 +29,7 @@ yarn.lock
 pnpm-lock.yaml

 # Build artifacts
+dist/
 *.tsbuildinfo
 .cache/
 .parcel-cache/
--- a/API.en.md
+++ b/API.en.md
@@ -33,6 +33,8 @@ Docs: [Overview](README.en.md) / [Architecture](docs/ARCHITECTURE.en.md) / [Depl
 | Health probes | `GET /healthz`, `GET /readyz` |
 | CORS | Enabled (uniformly covers `/v1/*`, `/anthropic/*`, `/v1beta/models/*`, and `/admin/*`; echoes the browser `Origin` when present, otherwise `*`; default allow-list includes `Content-Type`, `Authorization`, `X-API-Key`, `X-Ds2-Target-Account`, `X-Ds2-Source`, `X-Vercel-Protection-Bypass`, `X-Goog-Api-Key`, `Anthropic-Version`, `Anthropic-Beta`, and also accepts third-party preflight-requested headers such as `x-stainless-*`; `/v1/chat/completions` on Vercel Node Runtime matches the same behavior; internal-only `X-Ds2-Internal-Token` remains blocked) |

+- All JSON request bodies must be valid UTF-8; malformed byte sequences are rejected on ingress with `400 invalid json`.
+
 ### 3.0 Adapter-Layer Notes

 - OpenAI / Claude / Gemini protocols are now mounted on one shared `chi` router tree assembled in `internal/server/router.go`.
@@ -81,7 +83,7 @@ Two header formats accepted:
 - Token is in `config.keys` → **Managed account mode**: DS2API auto-selects an account via rotation
 - Token is not in `config.keys` → **Direct token mode**: treated as a DeepSeek token directly

-**Optional header**: `X-Ds2-Target-Account: <email_or_mobile>` — Pin a specific managed account.
+**Optional header**: `X-Ds2-Target-Account: <email_or_mobile>` — Pin a specific managed account; if the target account does not exist or the managed-account queue is exhausted, the request returns `429`, and current responses do not include `Retry-After`. If the account exists but login/refresh fails, the request returns the underlying `401` or upstream error.
 Gemini-compatible clients can also send `x-goog-api-key`, `?key=`, or `?api_key=` as the caller credential source.

 ### Admin Endpoints (`/admin/*`)
@@ -165,6 +167,8 @@ Gemini-compatible clients can also send `x-goog-api-key`, `?key=`, or `?api_key=
 | PUT | `/admin/chat-history/settings` | Admin | Update conversation history retention limit |
 | GET | `/admin/version` | Admin | Check current version and latest Release |

+OpenAI `/v1/*` paths are canonical. For clients configured with the bare DS2API service URL, the same OpenAI handlers are also exposed through root shortcuts: `/models`, `/models/{id}`, `/chat/completions`, `/responses`, `/responses/{response_id}`, `/embeddings`, and `/files`.
+
 ---

 ## Health Endpoints
@@ -196,11 +200,15 @@ No auth required. Returns the currently supported DeepSeek native model list.
  "object": "list",
  "data": [
    {"id": "deepseek-v4-flash", "object": "model", "created": 1677610602, "owned_by": "deepseek", "permission": []},
+    {"id": "deepseek-v4-flash-nothinking", "object": "model", "created": 1677610602, "owned_by": "deepseek", "permission": []},
    {"id": "deepseek-v4-pro", "object": "model", "created": 1677610602, "owned_by": "deepseek", "permission": []},
+    {"id": "deepseek-v4-pro-nothinking", "object": "model", "created": 1677610602, "owned_by": "deepseek", "permission": []},
    {"id": "deepseek-v4-flash-search", "object": "model", "created": 1677610602, "owned_by": "deepseek", "permission": []},
+    {"id": "deepseek-v4-flash-search-nothinking", "object": "model", "created": 1677610602, "owned_by": "deepseek", "permission": []},
    {"id": "deepseek-v4-pro-search", "object": "model", "created": 1677610602, "owned_by": "deepseek", "permission": []},
+    {"id": "deepseek-v4-pro-search-nothinking", "object": "model", "created": 1677610602, "owned_by": "deepseek", "permission": []},
    {"id": "deepseek-v4-vision", "object": "model", "created": 1677610602, "owned_by": "deepseek", "permission": []},
-    {"id": "deepseek-v4-vision-search", "object": "model", "created": 1677610602, "owned_by": "deepseek", "permission": []}
+    {"id": "deepseek-v4-vision-nothinking", "object": "model", "created": 1677610602, "owned_by": "deepseek", "permission": []}
  ]
 }
 ```
@@ -224,6 +232,8 @@ Built-in aliases come from `internal/config/models.go`; `config.model_aliases` c
 - Gemini: `gemini-2.5-pro`, `gemini-2.5-flash`, `gemini-pro-vision`
 - Other compatibility families: `llama-*`, `qwen-*`, `mistral-*`, and `command-*` fall back through family heuristics

+Current vision support resolves only to `deepseek-v4-vision` and does not expose a separate `vision-search` variant.
+
 Retired historical families such as `claude-1.*`, `claude-2.*`, `claude-instant-*`, and `gpt-3.5*` are explicitly rejected.

 ### `POST /v1/chat/completions`
@@ -297,7 +307,7 @@ data: [DONE]
 - When thinking is enabled, the stream may emit `delta.reasoning_content`
 - Text emits `delta.content`
 - Last chunk includes `finish_reason` and `usage`
- Token counting prefers pass-through from upstream DeepSeek SSE (`accumulated_token_usage` / `token_usage`), and only falls back to local estimation when upstream usage is absent
+- Token counting prefers pass-through from upstream DeepSeek SSE (`accumulated_token_usage` / `token_usage`), and only falls back to local estimation when upstream usage is absent. Failed/interrupted endings (for example `response.failed`) may not include `usage`

 #### Tool Calls

@@ -413,7 +423,7 @@ Business auth required. Returns OpenAI-compatible embeddings shape.
 | `model` | string | ✅ | Supports native models + alias mapping |
 | `input` | string/array | ✅ | Supports string, string array, token array |

-> Requires `embeddings.provider`. Current supported values: `mock` / `deterministic` / `builtin`. If missing/unsupported, returns standard error shape with HTTP 501.
+> Requires `embeddings.provider`. Current supported values: `mock` / `deterministic` / `builtin` (all three use the same local deterministic implementation). If missing/unsupported, returns standard error shape with HTTP 501.

 ### `POST /v1/files`

@@ -427,7 +437,7 @@ Business auth required. OpenAI Files-compatible upload endpoint; currently only
 Constraints and behavior:

 - `Content-Type` must be `multipart/form-data` (otherwise `400`).
- Total request size limit is `100 MiB` (over-limit returns `413`).
+- Total request size limit is **100 MiB** (over-limit returns `413`).
 - Success returns an OpenAI `file` object (`id/object/bytes/filename/purpose/status`, etc.) and includes `account_id` for source-account tracing.

 ---
@@ -481,6 +491,13 @@ anthropic-version: 2023-06-01
 | `stream` | boolean | ❌ | Default `false` |
 | `system` | string | ❌ | Optional system prompt |
 | `tools` | array | ❌ | Claude tool schema |
+| `thinking` | object | ❌ | Anthropic thinking config; translated into downstream reasoning control, and ignored by `-nothinking` models |
+| `temperature` | number | ❌ | Passed through to the downstream bridge; if `temperature` and `top_p` are both present, `temperature` wins |
+| `top_p` | number | ❌ | Passed through when `temperature` is absent |
+| `stop_sequences` | array | ❌ | Passed through as downstream stop sequences |
+| `tool_choice` | string/object | ❌ | Supports `auto` / `none` / `required` / `{"type":"function","name":"..."}` and is translated to downstream tool choice |
+
+> Note: `thinking`, `temperature`, `top_p`, `stop_sequences`, and `tool_choice` are translated through the compatibility bridge. Final behavior still depends on the selected model and upstream support. When both `temperature` and `top_p` are present, `temperature` takes precedence.

 #### Non-Stream Response

@@ -917,12 +934,15 @@ Updates proxy binding for a specific account.
  "message": "API test successful (session creation only)",
  "model": "deepseek-v4-flash",
  "session_count": 0,
-  "config_writable": true
+  "config_writable": true,
+  "config_warning": ""
 }
 ```

 If a `message` is provided, `thinking` may also be included when the upstream response carries reasoning text.

+When the configured file path is not writable (for example, read-only `/app/config.json` inside some containers), login/session testing still proceeds; `config_warning` is returned to indicate token persistence failed and the token is memory-only until restart.
+
 ### `POST /admin/accounts/test-all`

 Optional request field: `model`.
@@ -1206,7 +1226,7 @@ Clients should handle HTTP status code plus `error` / `detail` fields.
 | Code | Meaning |
 | --- | --- |
 | `401` | Authentication failed (invalid key/token, or expired admin JWT) |
-| `429` | Too many requests (exceeded inflight + queue capacity) |
+| `429` | Too many requests (exceeded inflight + queue capacity; current responses do not include `Retry-After`) |
 | `503` | Model unavailable or upstream error |

 ---
--- a/API.md
+++ b/API.md
@@ -33,6 +33,8 @@
 | 健康检查 | `GET /healthz`、`GET /readyz` |
 | CORS | 已启用（统一覆盖 `/v1/*`、`/anthropic/*`、`/v1beta/models/*`、`/admin/*`；浏览器有 `Origin` 时回显该 Origin，否则为 `*`；默认允许 `Content-Type`, `Authorization`, `X-API-Key`, `X-Ds2-Target-Account`, `X-Ds2-Source`, `X-Vercel-Protection-Bypass`, `X-Goog-Api-Key`, `Anthropic-Version`, `Anthropic-Beta`，并会放行预检里声明的第三方请求头，如 `x-stainless-*`；Vercel 上 `/v1/chat/completions` 的 Node Runtime 也对齐相同行为；内部专用头 `X-Ds2-Internal-Token` 仍被拦截） |

+- 所有 JSON 请求体都必须是合法 UTF-8；非法字节序列会在入站阶段被拒绝为 `400 invalid json`。
+
 ### 3.0 接口适配层说明

 - OpenAI / Claude / Gemini 三套协议已统一挂在同一 `chi` 路由树上，由 `internal/server/router.go` 负责装配。
@@ -81,7 +83,7 @@ Vercel 一键部署可先只填 `DS2API_ADMIN_KEY`，部署后在 `/admin` 导
 - token 在 `config.keys` 中 → **托管账号模式**，自动轮询选择账号
 - token 不在 `config.keys` 中 → **直通 token 模式**，直接作为 DeepSeek token 使用

-**可选请求头**：`X-Ds2-Target-Account: <email_or_mobile>` — 指定使用某个托管账号。
+**可选请求头**：`X-Ds2-Target-Account: <email_or_mobile>` — 指定使用某个托管账号；如果目标账号不存在，或管理账号队列已耗尽，相关业务请求会返回 `429`，当前不会附带 `Retry-After` 头。若账号存在但登录/刷新失败，则返回对应的 `401` 或上游错误。
 Gemini 兼容客户端还可以使用 `x-goog-api-key`、`?key=` 或 `?api_key=` 作为凭据来源。

 ### Admin 接口（`/admin/*`）
@@ -165,6 +167,8 @@ Gemini 兼容客户端还可以使用 `x-goog-api-key`、`?key=` 或 `?api_key=`
 | PUT | `/admin/chat-history/settings` | Admin | 更新对话记录保留条数 |
 | GET | `/admin/version` | Admin | 查询当前版本与最新 Release |

+OpenAI `/v1/*` 仍是规范路径。对于只配置 DS2API 根地址的客户端，同一套 OpenAI handler 也通过根路径快捷路由暴露：`/models`、`/models/{id}`、`/chat/completions`、`/responses`、`/responses/{response_id}`、`/embeddings`、`/files`。
+
 ---

 ## 健康检查
@@ -204,9 +208,7 @@ Gemini 兼容客户端还可以使用 `x-goog-api-key`、`?key=` 或 `?api_key=`
    {"id": "deepseek-v4-pro-search", "object": "model", "created": 1677610602, "owned_by": "deepseek", "permission": []},
    {"id": "deepseek-v4-pro-search-nothinking", "object": "model", "created": 1677610602, "owned_by": "deepseek", "permission": []},
    {"id": "deepseek-v4-vision", "object": "model", "created": 1677610602, "owned_by": "deepseek", "permission": []},
-    {"id": "deepseek-v4-vision-nothinking", "object": "model", "created": 1677610602, "owned_by": "deepseek", "permission": []},
-    {"id": "deepseek-v4-vision-search", "object": "model", "created": 1677610602, "owned_by": "deepseek", "permission": []},
-    {"id": "deepseek-v4-vision-search-nothinking", "object": "model", "created": 1677610602, "owned_by": "deepseek", "permission": []}
+    {"id": "deepseek-v4-vision-nothinking", "object": "model", "created": 1677610602, "owned_by": "deepseek", "permission": []}
  ]
 }
 ```
@@ -232,6 +234,7 @@ Gemini 兼容客户端还可以使用 `x-goog-api-key`、`?key=` 或 `?api_key=`
 - 其他兼容族：`llama-*`、`qwen-*`、`mistral-*`、`command-*` 会按家族启发式回退

 上述 alias 若在请求名后追加 `-nothinking` 后缀，也会映射到对应的强制关闭 thinking 版本。
+当前视觉能力仅对应 `deepseek-v4-vision` / `deepseek-v4-vision-nothinking`，不会解析出独立的 `vision-search` 变体。

 退役历史模型（如 `claude-1.*`、`claude-2.*`、`claude-instant-*`、`gpt-3.5*`）会被显式拒绝。

@@ -306,7 +309,7 @@ data: [DONE]
 - 开启 thinking 时会输出 `delta.reasoning_content`
 - 普通文本输出 `delta.content`
 - 最后一段包含 `finish_reason` 和 `usage`
- token 计数优先透传上游 DeepSeek SSE（如 `accumulated_token_usage` / `token_usage`）；仅在上游缺失时回退本地估算
+- token 计数优先透传上游 DeepSeek SSE（如 `accumulated_token_usage` / `token_usage`）；仅在上游缺失时回退本地估算。失败/中断型结束（例如 `response.failed`）可能不会携带 `usage`

 #### Tool Calls

@@ -423,7 +426,7 @@ data: [DONE]
 | `model` | string | ✅ | 支持原生模型 + alias 自动映射 |
 | `input` | string/array | ✅ | 支持字符串、字符串数组、token 数组 |

-> 需配置 `embeddings.provider`。当前支持：`mock` / `deterministic` / `builtin`。未配置或不支持时返回标准错误结构（HTTP 501）。
+> 需配置 `embeddings.provider`。当前支持：`mock` / `deterministic` / `builtin`（三者都走同一套本地确定性实现）。未配置或不支持时返回标准错误结构（HTTP 501）。

 ### `POST /v1/files`

@@ -437,7 +440,7 @@ data: [DONE]
 约束与行为：

 - 请求必须为 `multipart/form-data`，否则返回 `400`。
- 请求体总大小上限 `100 MiB`（超限返回 `413`）。
+- 请求体总大小上限 **100 MiB**（超限返回 `413`）。
 - 成功返回 OpenAI `file` 对象（`id/object/bytes/filename/purpose/status` 等字段），并附带 `account_id` 便于定位来源账号。

 ---
@@ -494,6 +497,13 @@ anthropic-version: 2023-06-01
 | `stream` | boolean | ❌ | 默认 `false` |
 | `system` | string | ❌ | 可选系统提示 |
 | `tools` | array | ❌ | Claude tool 定义 |
+| `thinking` | object | ❌ | Anthropic thinking 配置；会转译为下游 reasoning 控制，`-nothinking` 模型会忽略 |
+| `temperature` | number | ❌ | 透传到下游；若同时提供 `top_p`，以 `temperature` 为准 |
+| `top_p` | number | ❌ | 当未提供 `temperature` 时透传到下游 |
+| `stop_sequences` | array | ❌ | 透传到下游停用序列 |
+| `tool_choice` | string/object | ❌ | 支持 `auto` / `none` / `required` / `{"type":"function","name":"..."}`，并会转译为下游工具选择 |
+
+> 说明：上述 `thinking`、`temperature`、`top_p`、`stop_sequences`、`tool_choice` 都会走兼容层转译；最终是否生效仍取决于当前模型和上游能力。`temperature` 与 `top_p` 同时存在时，`temperature` 优先。

 #### 非流式响应

@@ -934,12 +944,15 @@ data: {"type":"message_stop"}
  "message": "API 测试成功（仅会话创建）",
  "model": "deepseek-v4-flash",
  "session_count": 0,
-  "config_writable": true
+  "config_writable": true,
+  "config_warning": ""
 }
 ```

 如果传入 `message`，还会附带 `thinking`（当上游返回思考内容时）。

+当部署环境配置文件路径不可写（例如容器内默认 `/app/config.json` 只读）时，登录与会话测试仍可继续；此时会返回 `config_warning` 提示 token 仅保存在内存、重启后丢失。
+
 ### `POST /admin/accounts/test-all`

 可选请求字段：`model`
@@ -1222,7 +1235,7 @@ Gemini 路由使用 Google 风格错误结构：
 | 状态码 | 说明 |
 | --- | --- |
 | `401` | 鉴权失败（key/token 无效，或 Admin JWT 过期） |
-| `429` | 请求过多（超出并发上限 + 等待队列） |
+| `429` | 请求过多（超出并发上限 + 等待队列；当前不附带 `Retry-After` 头） |
 | `503` | 模型不可用或上游服务异常 |

 ---
--- a/12
+++ b/12
@@ -28,6 +28,8 @@ FROM debian:bookworm-slim AS runtime-base
 WORKDIR /app
 RUN apt-get update \
    && apt-get install -y --no-install-recommends ca-certificates \
+    && groupadd -r ds2api && useradd -r -g ds2api -d /app -s /sbin/nologin ds2api \
+    && mkdir -p /app/data /data && chown -R ds2api:ds2api /app /data \
    && rm -rf /var/lib/apt/lists/*
 COPY --from=busybox-tools /bin/busybox /usr/local/bin/busybox
 EXPOSE 5001
@@ -36,8 +38,9 @@ CMD ["/usr/local/bin/ds2api"]
 FROM runtime-base AS runtime-from-source
 COPY --from=go-builder /out/ds2api /usr/local/bin/ds2api

-COPY --from=go-builder /app/config.example.json /app/config.example.json
-COPY --from=webui-builder /app/static/admin /app/static/admin
+COPY --from=go-builder --chown=ds2api:ds2api /app/config.example.json /app/config.example.json
+COPY --from=webui-builder --chown=ds2api:ds2api /app/static/admin /app/static/admin
+USER ds2api

 FROM busybox-tools AS dist-extract
 ARG TARGETARCH
@@ -60,7 +63,8 @@ RUN set -eux; \
 FROM runtime-base AS runtime-from-dist
 COPY --from=dist-extract /out/ds2api /usr/local/bin/ds2api

-COPY --from=dist-extract /out/config.example.json /app/config.example.json
-COPY --from=dist-extract /out/static/admin /app/static/admin
+COPY --from=dist-extract --chown=ds2api:ds2api /out/config.example.json /app/config.example.json
+COPY --from=dist-extract --chown=ds2api:ds2api /out/static/admin /app/static/admin
+USER ds2api

 FROM runtime-from-source AS final
--- a/README.MD
+++ b/README.MD
@@ -17,7 +17,7 @@

 语言 / Language: [中文](README.MD) | [English](README.en.md)

-将 DeepSeek Web 对话能力转换为 OpenAI、Claude 与 Gemini 兼容 API。后端为 **Go 全量实现**，前端为 React WebUI 管理台（源码在 `webui/`，部署时自动构建到 `static/admin`）。
+将 DeepSeek Web 对话能力转换为 OpenAI、Claude 与 Gemini 兼容 API。核心后端以 **Go** 实现，Vercel 流式桥接额外使用少量 Node Runtime，前端为 React WebUI 管理台（源码在 `webui/`，部署时自动构建到 `static/admin`）。

 文档入口：[文档导航](docs/README.md) / [架构说明](docs/ARCHITECTURE.md) / [接口文档](API.md)

@@ -31,6 +31,30 @@
 >
 > 请勿将本项目用于违反服务条款、协议、法律法规或平台规则的场景。商业使用前请自行确认 `LICENSE`、相关协议以及你是否获得了作者的书面许可。

+## 目录
+
+- [架构概览（摘要）](#架构概览摘要)
+- [核心能力](#核心能力)
+- [平台兼容矩阵](#平台兼容矩阵)
+- [模型支持](#模型支持)
+  - [OpenAI 接口](#openai-接口get-v1models)
+  - [Claude 接口](#claude-接口get-anthropicv1models)
+  - [Gemini 接口](#gemini-接口)
+- [快速开始](#快速开始)
+  - [方式一：下载 Release 构建包](#方式一下载-release-构建包)
+  - [方式二：Docker 运行](#方式二docker-运行)
+  - [方式三：Vercel 部署](#方式三vercel-部署)
+  - [方式四：本地源码运行](#方式四本地源码运行)
+- [配置说明](#配置说明)
+- [鉴权模式](#鉴权模式)
+- [并发模型](#并发模型)
+- [Tool Call 适配](#tool-call-适配)
+- [本地开发抓包工具](#本地开发抓包工具)
+- [文档索引](#文档索引)
+- [测试](#测试)
+- [Release 自动构建（GitHub Actions）](#release-自动构建github-actions)
+- [免责声明](#免责声明)
+
 ## 架构概览（摘要）

 ```mermaid
@@ -107,6 +131,8 @@ flowchart LR
 | WebUI 管理台 | `/admin` 单页应用（中英文双语、深色模式，支持查看服务器端对话记录） |
 | 运维探针 | `GET /healthz`（存活）、`GET /readyz`（就绪） |

+OpenAI `/v1/*` 仍是推荐的规范路径；同时支持 `/models`、`/chat/completions`、`/responses`、`/embeddings`、`/files` 等根路径快捷路由，方便只配置 DS2API 根地址的第三方客户端。
+
 ## 平台兼容矩阵

 | 级别 | 平台 | 当前状态 |
@@ -134,10 +160,9 @@ flowchart LR
 | expert | `deepseek-v4-pro-search-nothinking` | 永久关闭，不受请求参数影响 | ✅ |
 | vision | `deepseek-v4-vision` | 默认开启，可由请求参数控制 | ❌ |
 | vision | `deepseek-v4-vision-nothinking` | 永久关闭，不受请求参数影响 | ❌ |
-| vision | `deepseek-v4-vision-search` | 默认开启，可由请求参数控制 | ✅ |
-| vision | `deepseek-v4-vision-search-nothinking` | 永久关闭，不受请求参数影响 | ✅ |

 除原生模型外，也支持常见 alias 输入（如 `gpt-4.1`、`gpt-5`、`gpt-5-codex`、`o3`、`claude-*`、`gemini-*` 等），但 `/v1/models` 返回的是规范化后的 DeepSeek 原生模型 ID。若 alias 名本身追加 `-nothinking` 后缀，也会映射到对应的强制关思考模型。完整 alias 行为以 [API.md](API.md#模型-alias-解析策略) 和 `config.example.json` 为准。
+当前上游视觉模型只暴露 `vision` 通道，不提供独立的联网搜索视觉变体。

 ### Claude 接口（`GET /anthropic/v1/models`）

@@ -221,6 +246,8 @@ docker-compose logs -f
 ```

 默认 `docker-compose.yml` 会把宿主机 `6011` 映射到容器内的 `5001`。如果你希望直接对外暴露 `5001`，请设置 `DS2API_HOST_PORT=5001`（或者手动调整 `ports` 配置）。
+同时默认把 `./config.json` 挂载到容器 `/data/config.json`，并设置 `DS2API_CONFIG_PATH=/data/config.json`，用于避免 `/app` 只读导致运行时 token 持久化失败。
+镜像会预创建 `/data` 并授权给非 root 的 `ds2api` 用户；如果使用单文件 bind mount，请确保宿主机 `config.json` 对容器用户可读写，例如 `chmod 644 config.json`。

 更新镜像：`docker-compose up -d --build`

@@ -258,7 +285,7 @@ base64 < config.json | tr -d '\n'

 ### 方式四：本地源码运行

-**前置要求**：Go 1.26+，Node.js `20.19+` 或 `22.12+`（仅在需要构建 WebUI 时）
+**前置要求**：Go 1.26+，Node.js `20.19+` 或 `22.12+`（仅在需要构建 WebUI 时）；同时确保 `npm` 可用，建议 `npm 10+`

 ```bash
 # 1. 克隆仓库
@@ -277,7 +304,7 @@ go run ./cmd/ds2api

 服务实际绑定：`0.0.0.0:5001`，因此同一局域网设备通常也可以通过你的内网 IP 访问。

-> **WebUI 自动构建**：本地首次启动时，若 `static/admin` 不存在，会自动尝试执行 `npm ci`（仅在缺少依赖时）和 `npm run build -- --outDir static/admin --emptyOutDir`（需要本机有 Node.js）。你也可以手动构建：`./scripts/build-webui.sh`
+> **WebUI 自动构建**：本地首次启动时，若 `static/admin` 不存在，会自动尝试执行 `npm ci`（仅在缺少依赖时）和 `npm run build -- --outDir static/admin --emptyOutDir`（需要本机有 Node.js 和 npm）。你也可以手动构建：`./scripts/build-webui.sh`

 ## 配置说明

@@ -291,7 +318,7 @@ go run ./cmd/ds2api
 - `runtime`：账号并发、队列与 token 刷新策略，可通过 Admin Settings 热更新。
 - `auto_delete.mode`：请求结束后的远端会话清理策略，支持 `none` / `single` / `all`。
 - `history_split`：旧轮次拆分字段，已废弃并忽略，仅保留兼容旧配置。
- `current_input_file`：唯一生效的独立拆分策略；默认开启且阈值为 `0`，触发时将完整上下文合并上传为隐藏上下文文件。
+- `current_input_file`：唯一生效的独立拆分策略；默认开启且阈值为 `0`，触发时将完整上下文合并上传为 `DS2API_HISTORY.txt` 上下文文件。
 - 如果关闭 `current_input_file`，请求会直接透传，不上传拆分上下文文件。
 - `thinking_injection`：默认开启；在最新 user 消息末尾追加思考增强提示词，提高高强度推理与工具调用前的思考稳定性；`prompt` 留空时使用内置默认提示词。

@@ -307,6 +334,7 @@ go run ./cmd/ds2api
 | **直通 token 模式** | 传入 token 不在 `config.keys` 中时，直接作为 DeepSeek token 使用 |

 可选请求头 `X-Ds2-Target-Account`：指定使用某个托管账号（值为 email 或 mobile）。
+如果指定账号不存在，或者当前管理账号队列已满，请求会返回 `429`；当前 `429` 不附带 `Retry-After` 头。若账号存在但登录/刷新失败，则返回对应的鉴权错误。
 Gemini 路由还可以使用 `x-goog-api-key`，或在没有认证头时使用 `?key=` / `?api_key=` 作为调用方凭据。

 ## 并发模型
@@ -319,7 +347,7 @@ Gemini 路由还可以使用 `x-goog-api-key`，或在没有认证头时使用 `
 ```

 - 当 in-flight 槽位满时，请求进入等待队列，**不会立即 429**
- 超出总承载上限后才返回 `429 Too Many Requests`
+- 超出总承载上限后才返回 `429 Too Many Requests`，当前响应不附带 `Retry-After`
 - `GET /admin/queue/status` 返回实时并发状态

 ## Tool Call 适配
@@ -397,10 +425,10 @@ npm run build --prefix webui

 工作流文件：`.github/workflows/release-artifacts.yml`

- **触发条件**：仅在 GitHub Release `published` 时触发（普通 push 不会触发）
- **构建产物**：多平台二进制包（`linux/amd64`、`linux/arm64`、`linux/armv7`、`darwin/amd64`、`darwin/arm64`、`windows/amd64`、`windows/arm64`）+ `sha256sums.txt`
+- **触发条件**：默认仅在 GitHub Release `published` 时自动触发；也支持在 Actions 页面手动 `workflow_dispatch`，并填写 `release_tag` 复跑/补发
+- **构建产物**：多平台二进制包（`linux/amd64`、`linux/arm64`、`linux/armv7`、`darwin/amd64`、`darwin/arm64`、`windows/amd64`、`windows/arm64`）、Linux Docker 镜像导出包 + `sha256sums.txt`
 - **容器镜像发布**：仅推送到 GHCR（`ghcr.io/cjackhwang/ds2api`）
- **每个压缩包包含**：`ds2api` 可执行文件、`static/admin`、WASM 文件（同时支持内置 fallback）、`config.example.json` 配置示例、README、LICENSE
+- **每个二进制压缩包包含**：`ds2api` 可执行文件、`static/admin`、`config.example.json`、`.env.example`、`README.MD`、`README.en.md`、`LICENSE`

 ## 免责声明

--- a/README.en.md
+++ b/README.en.md
@@ -16,7 +16,7 @@

 Language: [中文](README.MD) | [English](README.en.md)

-DS2API converts DeepSeek Web chat capability into OpenAI-compatible, Claude-compatible, and Gemini-compatible APIs. The backend is a **pure Go implementation**, with a React WebUI admin panel (source in `webui/`, build output auto-generated to `static/admin` during deployment).
+DS2API converts DeepSeek Web chat capability into OpenAI-compatible, Claude-compatible, and Gemini-compatible APIs. The core backend is Go-based, with a small Node Runtime bridge used for Vercel streaming, and the React WebUI admin panel lives in `webui/` (build output auto-generated to `static/admin` during deployment).

 Documentation entry: [Docs Index](docs/README.md) / [Architecture](docs/ARCHITECTURE.en.md) / [API Reference](API.en.md)

@@ -28,6 +28,30 @@ Documentation entry: [Docs Index](docs/README.md) / [Architecture](docs/ARCHITEC
 >
 > Do not use this project in ways that violate service terms, agreements, laws, or platform rules. Before any commercial use, review the `LICENSE`, the relevant terms, and confirm that you have the author's written permission.

+## Table of Contents
+
+- [Architecture Overview (Summary)](#architecture-overview-summary)
+- [Key Capabilities](#key-capabilities)
+- [Platform Compatibility Matrix](#platform-compatibility-matrix)
+- [Model Support](#model-support)
+  - [OpenAI Endpoint](#openai-endpoint-get-v1models)
+  - [Claude Endpoint](#claude-endpoint-get-anthropicv1models)
+  - [Gemini Endpoint](#gemini-endpoint)
+- [Quick Start](#quick-start)
+  - [Option 1: Download Release Binaries](#option-1-download-release-binaries)
+  - [Option 2: Docker / GHCR](#option-2-docker--ghcr)
+  - [Option 3: Vercel](#option-3-vercel)
+  - [Option 4: Local Run](#option-4-local-run)
+- [Configuration](#configuration)
+- [Authentication Modes](#authentication-modes)
+- [Concurrency Model](#concurrency-model)
+- [Tool Call Adaptation](#tool-call-adaptation)
+- [Local Dev Packet Capture](#local-dev-packet-capture)
+- [Documentation Index](#documentation-index)
+- [Testing](#testing)
+- [Release Artifact Automation (GitHub Actions)](#release-artifact-automation-github-actions)
+- [Disclaimer](#disclaimer)
+
 ## Architecture Overview (Summary)

 ```mermaid
@@ -104,6 +128,8 @@ For the full module-by-module architecture and directory responsibilities, see [
 | WebUI Admin Panel | SPA at `/admin` (bilingual Chinese/English, dark mode, with server-side conversation history) |
 | Health Probes | `GET /healthz` (liveness), `GET /readyz` (readiness) |

+OpenAI `/v1/*` routes remain canonical, and DS2API also accepts root shortcuts such as `/models`, `/chat/completions`, `/responses`, `/embeddings`, and `/files` for clients configured with the bare service URL.
+
 ## Platform Compatibility Matrix

 | Tier | Platform | Status |
@@ -126,9 +152,9 @@ For the full module-by-module architecture and directory responsibilities, see [
 | default | `deepseek-v4-flash-search` | enabled by default, request-controlled | ✅ |
 | expert | `deepseek-v4-pro-search` | enabled by default, request-controlled | ✅ |
 | vision | `deepseek-v4-vision` | enabled by default, request-controlled | ❌ |
-| vision | `deepseek-v4-vision-search` | enabled by default, request-controlled | ✅ |

 Besides native IDs, DS2API also accepts common aliases as input (for example `gpt-4.1`, `gpt-5`, `gpt-5-codex`, `o3`, `claude-*`, `gemini-*`), but `/v1/models` returns normalized DeepSeek native model IDs. The complete alias behavior is documented in [API.en.md](API.en.md#model-alias-resolution) and `config.example.json`.
+Current upstream vision support exposes only the `vision` lane and does not provide a separate search-enabled vision variant.

 ### Claude Endpoint (`GET /anthropic/v1/models`)

@@ -209,6 +235,7 @@ docker-compose up -d
 ```

 The default `docker-compose.yml` uses `ghcr.io/cjackhwang/ds2api:latest` and maps host port `6011` to container port `5001`. If you want `5001` exposed directly, set `DS2API_HOST_PORT=5001` (or adjust the `ports` mapping).
+It also mounts `./config.json` to `/data/config.json` and sets `DS2API_CONFIG_PATH=/data/config.json` by default, which avoids runtime token persistence failures caused by read-only `/app`.

 Rebuild after updates: `docker-compose up -d --build`

@@ -279,7 +306,7 @@ Common fields:
 - `runtime`: account concurrency, queueing, and token refresh behavior, hot-reloadable via Admin Settings.
 - `auto_delete.mode`: remote session cleanup after each request, supporting `none` / `single` / `all`.
 - `history_split`: legacy multi-turn history split field, now ignored and kept only for backward-compatible config loading.
- `current_input_file`: the only active split mode; it is enabled by default and uploads the full context as a hidden context file once the character threshold is reached.
+- `current_input_file`: the only active split mode; it is enabled by default and uploads the full context as a `DS2API_HISTORY.txt` context file once the character threshold is reached.
 - If you turn off `current_input_file`, requests pass through directly without uploading any split context file.

 For the full environment variable list, see [docs/DEPLOY.en.md](docs/DEPLOY.en.md). For auth behavior, see [API.en.md](API.en.md#authentication).
@@ -382,10 +409,10 @@ npm run build --prefix webui

 Workflow: `.github/workflows/release-artifacts.yml`

- **Trigger**: only on GitHub Release `published` (normal pushes do not trigger builds)
- **Outputs**: multi-platform archives (`linux/amd64`, `linux/arm64`, `linux/armv7`, `darwin/amd64`, `darwin/arm64`, `windows/amd64`, `windows/arm64`) + `sha256sums.txt`
+- **Trigger**: by default only on GitHub Release `published`; you can also run it manually via `workflow_dispatch` and pass `release_tag` to rerun / backfill
+- **Outputs**: multi-platform binary archives (`linux/amd64`, `linux/arm64`, `linux/armv7`, `darwin/amd64`, `darwin/arm64`, `windows/amd64`, `windows/arm64`), Linux Docker image export tarballs, and `sha256sums.txt`
 - **Container publishing**: GHCR only (`ghcr.io/cjackhwang/ds2api`)
- **Each archive includes**: `ds2api` executable, `static/admin`, WASM file (with embedded fallback support), `config.example.json`-based config template, README, LICENSE
+- **Each binary archive includes**: the `ds2api` executable, `static/admin`, `config.example.json`, `.env.example`, `README.MD`, `README.en.md`, and `LICENSE`

 ## Disclaimer

--- a/2
+++ b/2
@@ -1 +1 @@
-4.1.1
+4.3.0
--- a/cmd/ds2api/main.go
+++ b/cmd/ds2api/main.go
@@ -35,8 +35,9 @@ func main() {
 	}

 	srv := &http.Server{
-		Addr:    "0.0.0.0:" + port,
-		Handler: app.Router,
+		Addr:              "0.0.0.0:" + port,
+		Handler:           app.Router,
+		ReadHeaderTimeout: 5 * time.Second,
 	}
 	localURL := fmt.Sprintf("http://127.0.0.1:%s", port)
 	lanIP := detectLANIPv4()
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -9,8 +9,9 @@ services:
      # Host port is configurable via DS2API_HOST_PORT; container port stays fixed at 5001.
      - "${DS2API_HOST_PORT:-6011}:5001"
    volumes:
-      - ./config.json:/app/config.json    # 配置文件
+      - ./config.json:/data/config.json   # 配置文件（持久化推荐路径）
    environment:
      - TZ=Asia/Shanghai
      - LOG_LEVEL=INFO
      - DS2API_ADMIN_KEY=${DS2API_ADMIN_KEY:-ds2api}
+      - DS2API_CONFIG_PATH=/data/config.json
--- a/docs/CONTRIBUTING.en.md
+++ b/docs/CONTRIBUTING.en.md
@@ -36,7 +36,7 @@ go run ./cmd/ds2api
 cd webui

 # 2. Install dependencies
-npm install
+npm ci

 # 3. Start dev server (hot reload)
 npm run dev
--- a/docs/CONTRIBUTING.md
+++ b/docs/CONTRIBUTING.md
@@ -36,7 +36,7 @@ go run ./cmd/ds2api
 cd webui

 # 2. 安装依赖
-npm install
+npm ci

 # 3. 启动开发服务器（热更新）
 npm run dev
--- a/docs/DEPLOY.en.md
+++ b/docs/DEPLOY.en.md
@@ -64,8 +64,8 @@ Use `config.json` as the single source of truth:

 Built-in GitHub Actions workflow: `.github/workflows/release-artifacts.yml`

- **Trigger**: only on Release `published` (no build on normal push)
- **Outputs**: multi-platform binary archives + `sha256sums.txt`
+- **Trigger**: by default only on Release `published`; you can also run it manually via `workflow_dispatch` and pass `release_tag` to rerun / backfill
+- **Outputs**: multi-platform binary archives, Linux Docker image export tarballs, and `sha256sums.txt`
 - **Container publishing**: GHCR only (`ghcr.io/cjackhwang/ds2api`)

 | Platform | Architecture | Format |
@@ -130,6 +130,9 @@ docker-compose logs -f
 ```

 The default `docker-compose.yml` directly uses `ghcr.io/cjackhwang/ds2api:latest` and maps host port `6011` to container port `5001`. If you want `5001` exposed directly, set `DS2API_HOST_PORT=5001` (or adjust the `ports` mapping).
+The compose template also defaults to `DS2API_CONFIG_PATH=/data/config.json` with `./config.json:/data/config.json` mounted, so deployments avoid read-only `/app` persistence issues by default.
+The image pre-creates `/data` and grants it to the non-root `ds2api` user. If you bind-mount a single host file, make sure `config.json` is readable/writable by the container user, for example with `chmod 644 config.json`; otherwise Linux UID/GID mismatches can still cause `open /data/config.json: permission denied`.
+Compatibility note: when `DS2API_CONFIG_PATH` is unset and runtime base dir is `/app`, newer versions prefer `/data/config.json`; if that file is missing but legacy `/app/config.json` exists, DS2API automatically falls back to the legacy path to avoid post-upgrade config loss.

 If you want a pinned version instead of `latest`, you can also pull a specific tag directly:

@@ -195,6 +198,11 @@ Notes:

 - **Port**: DS2API listens on `5001` by default; the template sets `PORT=5001`.
 - **Persistent config**: the template mounts `/data` and sets `DS2API_CONFIG_PATH=/data/config.json`. After importing config in Admin UI, it will be written and persisted to this path.
+- **`open /app/config.json: permission denied`**: this means the instance is trying to persist runtime tokens to a read-only path (commonly `/app` inside the image).  
+  Recommended handling:
+  1. Set a writable path explicitly: `DS2API_CONFIG_PATH=/data/config.json` (and mount a persistent volume at `/data`);
+  2. If you bootstrap with `DS2API_CONFIG_JSON` and do not need runtime writeback, keep env-backed mode (`DS2API_ENV_WRITEBACK` disabled);
+  3. In current versions, login/session tests continue even if persistence fails; Admin API returns a warning that token persistence failed and token is memory-only until restart.
 - **Build version**: Zeabur / regular `docker build` does not require `BUILD_VERSION` by default. The image prefers that build arg when provided, and automatically falls back to the repo-root `VERSION` file when it is absent.
 - **First login**: after deployment, open `/admin` and login with `DS2API_ADMIN_KEY` shown in Zeabur env/template instructions (recommended: rotate to a strong secret after first login).

@@ -263,6 +271,7 @@ VERCEL_TEAM_ID=team_xxxxxxxxxxxx   # optional for personal accounts
 | `VERCEL_TOKEN` | Vercel sync token | — |
 | `VERCEL_PROJECT_ID` | Vercel project ID | — |
 | `VERCEL_TEAM_ID` | Vercel team ID | — |
+| `DS2API_CHAT_HISTORY_PATH` | Chat history storage path (must be set to `/tmp/chat_history.json` on Vercel, otherwise unavailable due to read-only filesystem) | `data/chat_history.json` |
 | `DS2API_VERCEL_PROTECTION_BYPASS` | Deployment protection bypass for internal Node→Go calls | — |

 ### 3.4 Vercel Architecture
@@ -352,6 +361,22 @@ If API responses return Vercel HTML `Authentication Required`:
 - **Option B**: Add `x-vercel-protection-bypass` header to requests
 - **Option C**: Set `VERCEL_AUTOMATION_BYPASS_SECRET` (or `DS2API_VERCEL_PROTECTION_BYPASS`) for internal Node→Go calls

+#### Chat History Unavailable (read-only file system)
+
+```text
+create chat history dir: mkdir /var/task/data: read-only file system
+```
+
+**Cause**: Vercel Serverless functions have a read-only filesystem (`/var/task`). Chat history fails because it cannot create directories there.
+
+**Fix**: Add the following in Vercel Project Settings → Environment Variables:
+
+```text
+DS2API_CHAT_HISTORY_PATH=/tmp/chat_history.json
+```
+
+`/tmp` is the only writable directory in Vercel Serverless. Data is ephemeral (not persisted across cold starts), but the feature works within a single instance lifetime.
+
 ### 3.6 Build Artifacts Not Committed

 - `static/admin` directory is not in Git
@@ -394,7 +419,7 @@ Or step by step:

 ```bash
 cd webui
-npm install
+npm ci
 npm run build
 # Output goes to static/admin/
 ```
--- a/docs/DEPLOY.md
+++ b/docs/DEPLOY.md
@@ -64,8 +64,8 @@ cp config.example.json config.json

 仓库内置 GitHub Actions 工作流：`.github/workflows/release-artifacts.yml`

- **触发条件**：仅在 Release `published` 时触发（普通 push 不会构建）
- **构建产物**：多平台二进制压缩包 + `sha256sums.txt`
+- **触发条件**：默认仅在 Release `published` 时自动触发；也支持在 Actions 页面手动 `workflow_dispatch`，并填写 `release_tag` 复跑/补发
+- **构建产物**：多平台二进制压缩包、Linux Docker 镜像导出包 + `sha256sums.txt`
 - **容器镜像发布**：仅发布到 GHCR（`ghcr.io/cjackhwang/ds2api`）

 | 平台 | 架构 | 文件格式 |
@@ -130,6 +130,9 @@ docker-compose logs -f
 ```

 默认 `docker-compose.yml` 直接使用 `ghcr.io/cjackhwang/ds2api:latest`，并把宿主机 `6011` 映射到容器内的 `5001`。如果你希望直接对外暴露 `5001`，请设置 `DS2API_HOST_PORT=5001`（或者手动调整 `ports` 配置）。
+Compose 模板还会默认设置 `DS2API_CONFIG_PATH=/data/config.json` 并挂载 `./config.json:/data/config.json`，优先避免 `/app` 只读带来的配置持久化问题。
+镜像内会预创建 `/data` 并授权给非 root 的 `ds2api` 用户；如果你使用 bind mount 单文件，请确保宿主机 `config.json` 至少可被容器用户读取/写入，例如 `chmod 644 config.json`，否则 Linux UID/GID 不一致时仍可能出现 `open /data/config.json: permission denied`。
+兼容说明：若未设置 `DS2API_CONFIG_PATH` 且运行目录是 `/app`，新版本会优先使用 `/data/config.json`；当该文件不存在但检测到历史 `/app/config.json` 时，会自动回退读取旧路径，避免升级后“配置丢失”。

 如需固定版本，也可以直接拉取指定 tag：

@@ -195,6 +198,11 @@ healthcheck:

 - **端口**：服务默认监听 `5001`，模板会固定设置 `PORT=5001`。
 - **配置持久化**：模板挂载卷 `/data`，并设置 `DS2API_CONFIG_PATH=/data/config.json`；在管理台导入配置后，会写入并持久化到该路径。
+- **`open /app/config.json: permission denied`**：说明当前实例在尝试把运行时 token 持久化到只读路径（常见于镜像内 `/app`）。  
+  处理建议：
+  1. 显式设置可写路径：`DS2API_CONFIG_PATH=/data/config.json`（并挂载持久卷到 `/data`）；  
+  2. 若你使用 `DS2API_CONFIG_JSON` 启动且不需要运行时落盘，可保持环境变量模式（`DS2API_ENV_WRITEBACK` 关闭）；  
+  3. 最新版本中，即使持久化失败，登录/会话测试仍会继续，仅提示“token 未持久化（重启后丢失）”。
 - **构建版本号**：Zeabur / 普通 `docker build` 默认不需要传 `BUILD_VERSION`；镜像会优先使用该构建参数，未提供时自动回退到仓库根目录的 `VERSION` 文件。
 - **首次登录**：部署完成后访问 `/admin`，使用 Zeabur 环境变量/模板指引中的 `DS2API_ADMIN_KEY` 登录（建议首次登录后自行更换为强密码）。

@@ -263,6 +271,7 @@ VERCEL_TEAM_ID=team_xxxxxxxxxxxx   # 个人账号可留空
 | `VERCEL_TOKEN` | Vercel 同步 token | — |
 | `VERCEL_PROJECT_ID` | Vercel 项目 ID | — |
 | `VERCEL_TEAM_ID` | Vercel 团队 ID | — |
+| `DS2API_CHAT_HISTORY_PATH` | Chat history 存储路径（Vercel 上必须设为 `/tmp/chat_history.json`，否则因文件系统只读而不可用） | `data/chat_history.json` |
 | `DS2API_VERCEL_PROTECTION_BYPASS` | 部署保护绕过密钥（内部 Node→Go 调用） | — |

 ### 3.3 运行时行为配置（通过 Admin API 设置）
@@ -362,6 +371,22 @@ No Output Directory named "public" found after the Build completed.
 - **方案 B**：请求中添加 `x-vercel-protection-bypass` 头
 - **方案 C**：设置 `VERCEL_AUTOMATION_BYPASS_SECRET`（或 `DS2API_VERCEL_PROTECTION_BYPASS`），仅影响内部 Node→Go 调用

+#### Chat History 不可用（read-only file system）
+
+```text
+create chat history dir: mkdir /var/task/data: read-only file system
+```
+
+**原因**：Vercel Serverless 函数的文件系统（`/var/task`）为只读，chat history 尝试在该路径下创建目录失败。
+
+**解决**：在 Vercel Project Settings → Environment Variables 中添加：
+
+```text
+DS2API_CHAT_HISTORY_PATH=/tmp/chat_history.json
+```
+
+`/tmp` 是 Vercel Serverless 环境中唯一可写的目录。数据在函数冷启动之间不会持久化（ephemeral），但在单个实例生命周期内功能正常。
+
 ### 3.6 仓库不提交构建产物

 - `static/admin` 目录不在 Git 中
@@ -404,7 +429,7 @@ go run ./cmd/ds2api

 ```bash
 cd webui
-npm install
+npm ci
 npm run build
 # 产物输出到 static/admin/
 ```
--- a/docs/DeepSeekSSE行为结构说明-2026-04-05.md
+++ b/docs/DeepSeekSSE行为结构说明-2026-04-05.md
@@ -309,7 +309,18 @@ parse SSE block
 - 新模型可能增加新的 `p` 路径。
 - 新版本可能增加新的 fragment.type。
 - `CONTENT_FILTER` 的终态模板内容可能变化。
- 自动续写相关状态（如 `INCOMPLETE` / `AUTO_CONTINUE`）当前主要来自实测与实现兼容逻辑，后续字段形态仍可能变化。
+- 自动续写相关状态（如 `INCOMPLETE` / `AUTO_CONTINUE`）当前主要来自实测与实现兼容逻辑，后续字段形态仍可能变化。当前实现不会仅因早期 `WIP` 状态就自动继续；只有显式 `INCOMPLETE` 或 `auto_continue` 信号才会触发 continue。
 - 解析器应当对未知字段、未知路径、未知事件保持容忍。

 如果你要把这份说明用于实际开发，建议同时保留原始流样本、回放脚本和回归测试，不要只依赖本文。
+
+## 2026-04-29 最近线上样本增量观察
+
+基于 `longtext-deepseek-v4-flash-20260429` 与 `longtext-deepseek-v4-pro-20260429` 两个真实账号长文本样本，近期格式变化要点如下：
+
+1. `data:` 事件中仍大量出现 `{"v":"..."}` 的无路径增量（`p` 缺失），解析器必须把空路径视为可见正文候选，而不能只依赖 `response/content`。
+2. 对象形态 `v`（如 `{"text":"..."}` / `{"content":"..."}`）仍会出现，且可能与无路径 chunk 混用；仅按字符串处理会导致正文丢块。
+3. 多轮 continuation 场景下，后续 chunk 可能不再重复显式 `status`，状态机需要保留上一轮 `INCOMPLETE` 语义直到出现终态。
+4. 2026-04-29 起客户端头部版本基线上调到 `x-client-version: 2.0.3`，否则部分账号会出现上游行为不一致（包括空输出与补轮异常）。
+
+建议：新增样本默认回放应优先覆盖「长文本 + 多轮 + 无路径 chunk」组合，避免只用短样本导致回归漏检。
--- a/docs/README.md
+++ b/docs/README.md
@@ -22,7 +22,7 @@

 ### 文档维护约定

- 文档更新必须以实际代码实现为依据：总路由装配看 `internal/server/router.go`，协议/resource 路由看 `internal/httpapi/*/**/routes.go` 与 `internal/httpapi/admin/handler.go`，配置默认值看 `internal/config/*`，模型/alias 看 `internal/config/models.go`，prompt 兼容链路看 `docs/prompt-compatibility.md` 列出的代码入口。
+- 文档更新必须以实际代码实现为依据：总路由装配看 `internal/server/router.go`，协议/resource 路由看 `internal/httpapi/**/handler*.go` 与 `internal/httpapi/admin/handler.go`，配置默认值看 `internal/config/*`，模型/alias 看 `internal/config/models.go`，prompt 兼容链路看 `docs/prompt-compatibility.md` 列出的代码入口。
 - `README.MD` / `README.en.md`：面向首次接触用户，保留“是什么 + 怎么快速跑起来”。
 - `docs/ARCHITECTURE*.md`：面向开发者，集中维护项目结构、模块职责与调用链。
 - `API*.md`：面向客户端接入者，聚焦接口行为、鉴权和示例。
@@ -53,7 +53,7 @@ Recommended reading order:

 ### Maintenance conventions

- Documentation updates must be grounded in the actual implementation: root routing lives in `internal/server/router.go`, protocol/resource routes live in `internal/httpapi/*/**/routes.go` and `internal/httpapi/admin/handler.go`, config defaults in `internal/config/*`, models/aliases in `internal/config/models.go`, and the prompt compatibility pipeline in the code entrypoints listed by `docs/prompt-compatibility.md`.
+- Documentation updates must be grounded in the actual implementation: root routing lives in `internal/server/router.go`, protocol/resource routes live in `internal/httpapi/**/handler*.go` and `internal/httpapi/admin/handler.go`, config defaults in `internal/config/*`, models/aliases in `internal/config/models.go`, and the prompt compatibility pipeline in the code entrypoints listed by `docs/prompt-compatibility.md`.
 - `README.MD` / `README.en.md`: onboarding-oriented (“what + quick start”).
 - `docs/ARCHITECTURE*.md`: developer-oriented source of truth for module boundaries and execution flow.
 - `API*.md`: integration-oriented behavior/contracts.
--- a/docs/TESTING.md
+++ b/docs/TESTING.md
@@ -60,11 +60,10 @@ npm run build --prefix webui
 ./tests/scripts/check-refactor-line-gate.sh
 ./tests/scripts/check-node-split-syntax.sh
 ./tests/scripts/check-cross-build.sh
-
-# 历史阶段门禁：阶段 6 手工烟测签字检查（默认读取 plans/stage6-manual-smoke.md）
-./tests/scripts/check-stage6-manual-smoke.sh
 ```

+说明：`plans/stage6-manual-smoke.md` 已移除，阶段 6 手工烟测不再作为当前 CI 或发布门禁。
+
 ### 端到端测试 | End-to-End Tests

 ```bash
--- a/docs/prompt-compatibility.md
+++ b/docs/prompt-compatibility.md
@@ -98,13 +98,16 @@ DS2API 当前的核心思路，不是把客户端传来的 `messages`、`tools`
 - `prompt` 才是对话上下文主载体。
 - `ref_file_ids` 只承载文件引用，不承载普通文本消息。
 - `tools` 不会作为“原生工具 schema”直接下发给下游，而是被改写进 `prompt`。
+- 对外返回给客户端的 `prompt_tokens` / `input_tokens` / `promptTokenCount` 不再按“最后一条消息”或字符粗估近似返回，而是基于**完整上下文 prompt**做 tokenizer 计数；为了避免上下文实际超限但客户端误以为还能塞下，请求侧上下文 token 会额外保守上浮一点，宁可略大也不低估。
 - 当前 `/v1/chat/completions` 业务路径仍是“每次请求新建一个远端 `chat_session_id`，并默认发送 `parent_message_id: null`”；因此 DS2API 对外默认表现为“新会话 + prompt 拼历史”，而不是复用 DeepSeek 原生会话树。
 - 但 DeepSeek 远端本身支持同一 `chat_session_id` 的跨轮次持续对话。2026-04-27 已用项目内现有 DeepSeek client 做过一次不改业务代码的双轮实测：同一 `chat_session_id` 下，第 1 轮返回 `request_message_id=1` / `response_message_id=2` / 文本 `SESSION_TEST_ONE`；第 2 轮重新获取一次 PoW，并发送 `parent_message_id=2` 后，成功返回 `request_message_id=3` / `response_message_id=4` / 文本 `SESSION_TEST_TWO`。这说明“同远端会话持续聊天”能力存在，且每轮需要携带正确的 parent/message 链接信息，同时重新获取对应轮次可用的 PoW。
 - OpenAI Chat / Responses 原生走统一 OpenAI 标准化与 DeepSeek payload 组装；Claude / Gemini 会尽量复用 OpenAI prompt/tool 语义，其中 Gemini 直接复用 `promptcompat.BuildOpenAIPromptForAdapter`，Claude 消息接口在可代理场景会转换为 OpenAI chat 形态再执行。
 - 客户端传入的 thinking / reasoning 开关会被归一到下游 `thinking_enabled`。Gemini `generationConfig.thinkingConfig.thinkingBudget` 会翻译成同一套 thinking 开关；关闭时即使上游返回 `response/thinking_content`，兼容层也不会把它当作可见正文输出。若最终解析出的模型名带 `-nothinking` 后缀，则会无条件强制关闭 thinking，优先级高于请求体中的 `thinking` / `reasoning` / `reasoning_effort`。Claude surface 在流式请求且未显式声明 `thinking` 时，仍按 Anthropic 语义默认关闭；但在非流式代理场景，兼容层会内部开启一次下游 thinking，用于捕获“正文为空、工具调用落在 thinking 里”的情况，随后在回包前剥离用户不可见的 thinking block。
- 对 OpenAI Chat / Responses 的非流式收尾，如果最终可见正文为空，兼容层会优先尝试把思维链中的独立 DSML / XML 工具块当作真实工具调用解析出来。流式链路也会在收尾阶段做同样的 fallback 检测，但不会因为思维链内容去中途拦截或改写流式输出；thinking / reasoning 增量仍按原样先发，只有在结束收尾时才可能补发最终工具调用结果。补发结果会作为本轮 assistant 的结构化 `tool_calls` / `function_call` 输出返回，而不是塞进 `content` 文本；如果客户端没有开启 thinking / reasoning，思维链只用于检测，不会作为 `reasoning_content` 或可见正文暴露。只有正文为空且思维链里也没有可执行工具调用时，才继续按空回复错误处理。
+- 对 OpenAI Chat / Responses 的非流式收尾，如果最终可见正文为空，兼容层会优先尝试把思维链中的独立 DSML / XML 工具块当作真实工具调用解析出来。流式链路也会在收尾阶段做同样的 fallback 检测，但不会因为思维链内容去中途拦截或改写流式输出；真正的工具识别始终基于原始上游文本，而不是基于“已经做过可见输出清洗”的版本，因此即使最终可见层会剥离完整 leaked DSML / XML `tool_calls` wrapper、并抑制全空参数或无效 wrapper 块，也不会影响真实工具调用转成结构化 `tool_calls` / `function_call`。补发结果会作为本轮 assistant 的结构化 `tool_calls` / `function_call` 输出返回，而不是塞进 `content` 文本；如果客户端没有开启 thinking / reasoning，思维链只用于检测，不会作为 `reasoning_content` 或可见正文暴露。只有正文为空且思维链里也没有可执行工具调用时，才继续按空回复错误处理。
 - OpenAI Chat / Responses 的空回复错误处理之前会默认做一次内部补偿重试：第一次上游完整结束后，如果最终可见正文为空、没有解析到工具调用、也没有已经向客户端流式发出工具调用，并且终止原因不是 `content_filter`，兼容层会复用同一个 `chat_session_id`、账号、token 与工具策略，把原始 completion `prompt` 追加固定后缀 `Previous reply had no visible output. Please regenerate the visible final answer or tool call now.` 后重新提交一次。重试遵循 DeepSeek 多轮对话协议：从第一次上游 SSE 流中提取 `response_message_id`，并在重试 payload 中设置 `parent_message_id` 为该值，使重试成为同一会话的后续轮次而非断裂的根消息；同时重新获取一次 PoW（若 PoW 获取失败则回退到原始 PoW）。该重试不会重新标准化消息、不会新建 session、不会切换账号，也不会向流式客户端插入重试标记；第二次 thinking / reasoning 会按正常增量直接接到第一次之后，并继续使用 overlap trim 去重。若第二次仍为空，终端错误码仍保持现有 `upstream_empty_output`；若任一尝试触发空 `content_filter`，不做补偿重试并保持 `content_filter` 错误。JS Vercel 运行时同样设置 `parent_message_id`，但因无法直接调用 PoW API 而复用原始 PoW。

+- OpenAI Chat / Responses 在最终可见正文渲染阶段，会把 DeepSeek 搜索返回中的 `[citation:N]` / `[reference:N]` 标记替换成对应 Markdown 链接。`citation` 标记按一基序号解析；`reference` 标记只有在同一段正文中出现 `[reference:0]`（允许冒号后有空格）时才按零基序号映射，并且不会影响同段正文里的 `citation` 标记。
+
 ## 5. prompt 是怎么拼出来的

 OpenAI Chat / Responses 在标准化后、current input file 之前，会默认执行 `thinking_injection` 增强。它参考 DeepSeek V4 “把控制指令放在 user 消息末尾更稳定”的用法，在最新 user message 后追加思考增强提示词。当前内置默认提示词以 `Reasoning Effort: Absolute maximum with no shortcuts permitted.` 开头，并继续要求模型充分分解问题、覆盖潜在路径与边界条件、把完整推演过程显式写出。该开关默认启用，可通过 `thinking_injection.enabled=false` 关闭；也可以通过 `thinking_injection.prompt` 自定义提示词，留空时使用内置默认提示词。
@@ -114,6 +117,11 @@ OpenAI Chat / Responses 在标准化后、current input file 之前，会默认
 - 普通请求会直接出现在最终 `prompt` 的最新 user block 末尾。
 - 如果触发 current input file，它会进入完整上下文文件中。

+另外，`MessagesPrepareWithThinking` 还会在最终 prompt 的最前面预置一段固定的 system 级“输出完整性约束（Output integrity guard）”：
+
+- 如果上游上下文、工具输出或解析后的文本出现乱码、损坏、部分解析、重复或其他畸形片段，不要模仿、不要回显，只输出给用户的正确内容。
+- 这段约束位于普通 system / tool prompt 之前，因此是当前最终 prompt 里的最高优先级前置指令。
+
 ### 5.1 角色标记

 最终 prompt 使用 DeepSeek 风格角色标记：
@@ -152,9 +160,13 @@ OpenAI Chat / Responses 在标准化后、current input file 之前，会默认

 工具调用正例现在优先示范官方 DSML 风格：`<|DSML|tool_calls>` → `<|DSML|invoke name="...">` → `<|DSML|parameter name="...">`。
 兼容层仍接受旧式纯 `<tool_calls>` wrapper，但提示词会优先要求模型输出官方 DSML 标签，并强调不能只输出 closing wrapper 而漏掉 opening tag。需要注意：这是“兼容 DSML 外壳，内部仍以 XML 解析语义为准”，不是原生 DSML 全链路实现；DSML 标签会在解析入口归一化回现有 XML 标签后继续走同一套 parser。
-数组参数使用 `<item>...</item>` 子节点表示；当某个参数体只包含 item 子节点时，Go / Node 解析器会把它还原成数组，避免 `questions` / `options` 这类 schema 中要求 array 的参数被误解析成 `{ "item": ... }` 对象。若模型把完整结构化 XML fragment 误包进 CDATA，兼容层会在保护 `content` / `command` 等原文字段的前提下，尝试把非原文字段中的 CDATA XML fragment 还原成 object / array。
+数组参数使用 `<item>...</item>` 子节点表示；当某个参数体只包含 item 子节点时，Go / Node 解析器会把它还原成数组，避免 `questions` / `options` 这类 schema 中要求 array 的参数被误解析成 `{ "item": ... }` 对象。除此之外，解析器还会回收一些更松散的列表写法，例如 JSON array 字面量或逗号分隔的 JSON 项序列，只要它们足够明确；但 `<item>` 仍然是首选形态。若模型把完整结构化 XML fragment 误包进 CDATA，兼容层会在保护 `content` / `command` 等原文字段的前提下，尝试把非原文字段中的 CDATA XML fragment 还原成 object / array。不过，如果 CDATA 只是单个平面的 XML/HTML 标签，例如 `<b>urgent</b>` 这种行内标记，兼容层会保留原始字符串，不会强行升成 object / array；只有明显表示结构的 CDATA 片段，例如多兄弟节点、嵌套子节点或 `item` 列表，才会触发结构化恢复。
+Go 侧读取 DeepSeek SSE 时不再依赖 `bufio.Scanner` 的固定 2MiB 单行上限；当写文件类工具把很长的 `content` 放在单个 `data:` 行里返回时，非流式收集、流式解析和 auto-continue 透传都会保留完整行，再进入同一套工具解析与序列化流程。
+在 assistant 最终回包阶段，如果某个 tool 参数在声明 schema 中明确是 `string`，兼容层会在把解析后的 `tool_calls` / `function_call` 重新序列化成 OpenAI / Responses / Claude 可见参数前，递归把该路径上的 number / bool / object / array 统一转成字符串；其中 object / array 会压成紧凑 JSON 字符串。这个保护只对 schema 明确声明为 string 的路径生效，不会改写本来就是 `number` / `boolean` / `object` / `array` 的参数。这样可以兼容 DeepSeek 输出了结构化片段、但上游客户端工具 schema 又严格要求字符串参数的场景（例如 `content`、`prompt`、`path`、`taskId` 等）。
+工具 schema 的权威来源始终是**当前请求实际携带的 schema**，而不是同名工具在其他 runtime（Claude Code / OpenCode / Codex 等）里的默认印象。兼容层现在会同时兼容 OpenAI 风格 `function.parameters`、直接工具对象上的 `parameters` / `input_schema`、以及 camelCase 的 `inputSchema` / `schema`，并在最终输出阶段按这份请求内 schema 决定是保留 array/object，还是仅对明确声明为 `string` 的路径做字符串化。该规则同样适用于 Claude 的流式收尾和 Vercel Node 流式 tool-call formatter，避免不同 runtime 因 schema shape 差异而出现同名工具参数类型漂移。
 正例中的工具名只会来自当前请求实际声明的工具；如果当前请求没有足够的已知工具形态，就省略对应的单工具、多工具或嵌套示例，避免把不可用工具名写进 prompt。
 对执行类工具，脚本内容必须进入执行参数本身：`Bash` / `execute_command` 使用 `command`，`exec_command` 使用 `cmd`；不要把脚本示范成 `path` / `content` 文件写入参数。
+如果当前请求声明了 `Read` / `read_file` 这类读取工具，兼容层会额外注入一条 read-tool cache guard：当读取结果只表示“文件未变更 / 已在历史中 / 请引用先前上下文 / 没有正文内容”时，模型必须把它视为内容不可用，不能反复调用同一个无正文读取；应改为请求完整正文读取能力，或向用户说明需要重新提供文件内容。这个约束只缓解客户端缓存返回空内容导致的死循环，DS2API 不会也无法凭空恢复客户端本地文件正文。

 OpenAI 路径实现：
 [internal/promptcompat/tool_prompt.go](../internal/promptcompat/tool_prompt.go)
@@ -231,6 +243,14 @@ OpenAI 文件相关实现：
 - 文件 ID 收集：
  [internal/promptcompat/file_refs.go](../internal/promptcompat/file_refs.go)

+OpenAI 的文件上传现在不再是“只传文件本体”的通用路径，而是会先根据请求里的 `model` 解析出 DeepSeek 的上传类型，并把它透传到上传接口的 `x-model-type`。当前可见的上传类型就是 `default` / `expert` / `vision`，其中 vision 请求上传图片时必须带上 `vision`，否则下游容易退回到仅文本或 OCR 语义。这个模型类型会同时用于：
+
+- `/v1/files` 这类独立文件上传入口
+- Chat / Responses 的 inline 图片、附件上传
+- current input file 触发时生成的 `DS2API_HISTORY.txt` 上下文文件
+
+也就是说，文件上传和完成请求的 `model_type` 现在是一致的：完成 payload 里仍然是 `model_type`，上传文件则会在 DeepSeek 上传阶段携带同样的模型类型信息。
+
 结论：

 - “systemprompt 文字”在 prompt 里
@@ -242,9 +262,10 @@ OpenAI 文件相关实现：

 兼容层现在只保留 `current_input_file` 这一种拆分方式；旧的 `history_split` 已废弃，只保留为兼容旧配置的字段，不再参与请求处理。

- `current_input_file` 默认开启；它用于把“完整上下文”合并进隐藏上下文文件。当最新 user turn 的纯文本长度达到 `current_input_file.min_chars`（默认 `0`）时，兼容层会上传一个文件名为 `IGNORE.txt` 的上下文文件，并在文件内容前加入一个明确的 `context note`，提示模型这是被压缩过的历史记录而不是新指令；live prompt 也会显式说明当前处于 compacted-context mode，要求模型用已提供的历史来还原上下文状态并直接回答最新请求，避免把重复工具调用或重复提问当成新的起点。
+- `current_input_file` 默认开启；它用于把“完整上下文”合并进 `DS2API_HISTORY.txt` 上下文文件。当最新 user turn 的纯文本长度达到 `current_input_file.min_chars`（默认 `0`）时，兼容层会上传一个文件名为 `DS2API_HISTORY.txt` 的上下文文件。文件内容会先做 OpenAI 消息标准化，再序列化成按轮次编号的 `DS2API_HISTORY.txt` 风格 transcript，带有 `# DS2API_HISTORY.txt` 标题和 `=== N. ROLE ===` 分段；live prompt 中则会给出一个 continuation 语气的 user 消息，引导模型从 `DS2API_HISTORY.txt` 的最新状态继续推进，并直接回答最新请求，避免把任务拉回起点。
 - 如果 `current_input_file.enabled=false`，请求会直接透传，不上传任何拆分上下文文件。
 - 旧的 `history_split.enabled` / `history_split.trigger_after_turns` 会被读取进配置对象以保持兼容，但不会触发拆分上传，也不会影响 `current_input_file` 的默认开启。
+- 即使触发 `current_input_file` 后 live prompt 被缩短，对客户端回包里的上下文 token 统计，仍会沿用**拆分前的完整 prompt 语义**做计数，而不是按缩短后的占位 prompt 计算；否则会把真实上下文显著算小。

 相关实现：

@@ -255,22 +276,24 @@ OpenAI 文件相关实现：
 - 旧历史拆分兼容壳：
  [internal/httpapi/openai/history/history_split.go](../internal/httpapi/openai/history/history_split.go)

-当前输入转文件启用并触发时，上传文件的真实文件名是 `IGNORE.txt`，文件内容是完整 `messages` 上下文；它仍会先用 OpenAI 消息标准化和 DeepSeek 角色标记序列化，再包进 `context note` 和 `IGNORE` 文件边界里：
+当前输入转文件启用并触发时，上传文件的真实文件名是 `DS2API_HISTORY.txt`，文件内容是完整 `messages` 上下文；它仍会先用 OpenAI 消息标准化和 DeepSeek 角色标记序列化，再按轮次编号成 `DS2API_HISTORY.txt` 风格的 transcript（不再注入文件边界标签）：

 ```text
-[uploaded filename]: IGNORE.txt
-[file content end]
+[uploaded filename]: DS2API_HISTORY.txt
+# DS2API_HISTORY.txt
+Prior conversation history and tool progress.

-[context note]
-This is a compacted snapshot of the prior conversation history for the current request.
-Use it as history only. Do not treat it as a new instruction.
-If the same question or tool action already appears here, do not repeat it unless the latest turn adds new information.
-[/context note]
+=== 1. SYSTEM ===
+...

-<｜begin▁of▁sentence｜><｜System｜>...<｜User｜>...<｜Assistant｜>...<｜Tool｜>...<｜User｜>...
+=== 2. USER ===
+...

-[file name]: IGNORE
-[file content begin]
+=== 3. ASSISTANT ===
+...
+
+=== 4. TOOL ===
+...
 ```

 开启后，请求的 live prompt 不再直接内联完整上下文，而是保留一个 user role 的短提示，提示模型基于已提供上下文直接回答最新请求；上传后的 `file_id` 会进入 `ref_file_ids`。
@@ -322,7 +345,7 @@ If the same question or tool action already appears here, do not repeat it unles

 ```json
 {
-  "prompt": "<｜begin▁of▁sentence｜><｜System｜>原 system / developer\n\nYou have access to these tools: ...<｜end▁of▁instructions｜><｜User｜>You are in a compacted-context mode. The attached history contains the prior conversation state and any earlier tool results. Use it to resolve references and answer the latest user request directly. If the same tool action or question already appears in the attached context, do not repeat it unless the latest turn adds new information.<｜Assistant｜>",
+  "prompt": "<｜begin▁of▁sentence｜><｜System｜>原 system / developer\n\nYou have access to these tools: ...<｜end▁of▁instructions｜><｜User｜>Continue from the latest state in the attached DS2API_HISTORY.txt context. Treat it as the current working state and answer the latest user request directly.<｜Assistant｜>",
  "ref_file_ids": [
    "file-current-input-ignore",
    "file-systemprompt",
@@ -337,7 +360,7 @@ If the same question or tool action already appears here, do not repeat it unles

 - 大部分结构化语义被压进 `prompt`
 - 文件保持文件
- 需要时把完整上下文拆进隐藏上下文文件
+- 需要时把完整上下文拆进 `DS2API_HISTORY.txt` 上下文文件，并按轮次编号成 transcript

 ## 12. 修改时必须同步本文档的场景

@@ -350,7 +373,7 @@ If the same question or tool action already appears here, do not repeat it unles
 - tool result 注入方式变更
 - tool prompt 模板或 tool_choice 约束变更
 - inline 文件上传 / 文件引用收集规则变更
- current input file 触发条件、上传格式、`IGNORE` 包装格式变更
+- current input file 触发条件、上传格式、`DS2API_HISTORY.txt` transcript 结构变更
 - 旧 `history_split` 兼容逻辑的读取、忽略或退化行为变更
 - completion payload 字段语义变更
 - Claude / Gemini 对这套统一语义的复用关系变更
--- a/docs/toolcall-semantics.md
+++ b/docs/toolcall-semantics.md
@@ -26,7 +26,7 @@
 </tool_calls>
 ```

-这不是原生 DSML 全链路实现。DSML 只作为 prompt 外壳和解析入口别名；进入 parser 前会被归一化成 `<tool_calls>` / `<invoke>` / `<parameter>`，内部仍以现有 XML 解析语义为准。
+这不是原生 DSML 全链路实现。DSML 主要用于让模型有意识地输出协议标识，隔离普通 XML 语义；进入 parser 前会按固定本地标签名归一化成 `<tool_calls>` / `<invoke>` / `<parameter>`，内部仍以现有 XML 解析语义为准。

 约束：

@@ -39,7 +39,8 @@
 兼容修复：

 - 如果模型漏掉 opening wrapper，但后面仍输出了一个或多个 invoke 并以 closing wrapper 收尾，Go 解析链路会在解析前补回缺失的 opening wrapper。
- 如果模型把 DSML 标签里的分隔符 `|` 写漏成空格（例如 `<|DSML tool_calls>` / `<|DSML invoke>` / `<|DSML parameter>`，或无 leading pipe 的 `<DSML tool_calls>` 形态），或把 `DSML` 与工具标签名直接黏连（例如 `<DSMLtool_calls>` / `<DSMLinvoke>` / `<DSMLparameter>`），Go / Node 会在固定工具标签名范围内归一化；相似但非工具标签名（如 `tool_calls_extra`）仍按普通文本处理。
+- Go / Node 解析层不再枚举每一种 DSML typo。它会把工具标签名前的 `DSML`、管道符 `|` / `｜`、空白、重复 leading `<` 视为可容忍的协议噪声，然后只匹配固定本地标签名 `tool_calls` / `invoke` / `parameter`。例如 `<DSML|tool_calls>`、`<<|DSML|tool_calls>`、`<|DSML tool_calls>`、`<DSMLtool_calls>`、`<<DSML|DSML|tool_calls>` 都会归一化；相似但非固定标签名（如 `tool_calls_extra`）仍按普通文本处理。
+- 如果模型在固定工具标签名后多输出一个尾部管道符，例如 `<|DSML|tool_calls|` / `<|DSML|invoke|` / `<|DSML|parameter|`，兼容层会把这个尾部 `|` 当作异常标签终止符并补齐缺失的 `>`；如果后面已经有 `>`，也会消费这个多余 `|` 后再归一化。
 - 这是一个针对常见模型失误的窄修复，不改变推荐输出格式；prompt 仍要求模型直接输出完整 DSML 外壳。
 - 裸 `<invoke ...>` / `<parameter ...>` 不会被当成“已支持的工具语法”；只有 `tool_calls` wrapper 或可修复的缺失 opening wrapper 才会进入工具调用路径。

@@ -53,7 +54,7 @@

 在流式链路中（Go / Node 一致）：

- DSML `<|DSML|tool_calls>` wrapper、兼容变体（`<dsml|tool_calls>`、`<｜tool_calls>`、`<|tool_calls>`）、窄容错空格分隔形态（如 `<|DSML tool_calls>`）、黏连形态（如 `<DSMLtool_calls>`）和 canonical `<tool_calls>` wrapper 都会进入结构化捕获
+- DSML `<|DSML|tool_calls>` wrapper、基于固定本地标签名的 DSML 噪声容错形态、尾部管道符形态（如 `<|DSML|tool_calls|`）和 canonical `<tool_calls>` wrapper 都会进入结构化捕获
 - 如果流里直接从 invoke 开始，但后面补上了 closing wrapper，Go 流式筛分也会按缺失 opening wrapper 的修复路径尝试恢复
 - 已识别成功的工具调用不会再次回流到普通文本
 - 不符合新格式的块不会执行，并继续按原样文本透传
@@ -61,10 +62,11 @@
 - 支持嵌套围栏（如 4 反引号嵌套 3 反引号）和 CDATA 内围栏保护
 - 如果模型把 `<![CDATA[` 打开后却没有闭合，流式扫描阶段仍会保守地继续缓冲，不会误把 CDATA 里的示例 XML 当成真实工具调用；在最终 parse / flush 恢复阶段，会对这类 loose CDATA 做窄修复，尽量保住外层已完整包裹的真实工具调用
 - 当文本中 mention 了某种标签名（如 `<dsml|tool_calls>` 或 Markdown inline code 里的 `<|DSML|tool_calls>`）而后面紧跟真正工具调用时，sieve 会跳过不可解析的 mention 候选并继续匹配后续真实工具块，不会因 mention 导致工具调用丢失，也不会截断 mention 后的正文
+- Go 侧 SSE 读取不再使用 `bufio.Scanner` 的固定 token 上限；单个 `data:` 行中包含很长的写文件参数时，非流式收集、流式解析与 auto-continue 透传都应保留完整行，再交给 tool parser 处理

 另外，`<parameter>` 的值如果本身是合法 JSON 字面量，也会按结构化值解析，而不是一律保留为字符串。例如 `123`、`true`、`null`、`[1,2]`、`{"a":1}` 都会还原成对应的 number / boolean / null / array / object。
 结构化 XML 参数也会还原为 JSON 结构：如果参数体只包含一个或多个 `<item>...</item>` 子节点，会输出数组；嵌套对象里的 item-only 字段也同样按数组处理。例如 `<parameter name="questions"><item><question>...</question></item></parameter>` 会输出 `{"questions":[{"question":"..."}]}`，而不是 `{"questions":{"item":...}}`。
-如果模型误把完整结构化 XML fragment 放进 CDATA，Go / Node 会先保护明显的原文字段（如 `content` / `command` / `prompt` / `old_string` / `new_string`），其余参数会尝试把 CDATA 内的完整 XML fragment 还原成 object / array；常见的 `<br>` 分隔符会按换行归一化后再解析。
+如果模型误把完整结构化 XML fragment 放进 CDATA，Go / Node 会先保护明显的原文字段（如 `content` / `command` / `prompt` / `old_string` / `new_string`），其余参数会尝试把 CDATA 内的完整 XML fragment 还原成 object / array；常见的 `<br>` 分隔符会按换行归一化后再解析。但如果 CDATA 只是单个平面的 XML/HTML 标签，例如 `<b>urgent</b>` 这种行内标记，兼容层会把它保留为原始字符串，而不会强行升成 object / array；只有明显表示结构的 CDATA 片段，例如多兄弟节点、嵌套子节点或 `item` 列表，才会触发结构化恢复。

 ## 4) 输出结构

@@ -94,7 +96,7 @@ node --test tests/node/stream-tool-sieve.test.js

 - DSML `<|DSML|tool_calls>` wrapper 正常解析
 - legacy canonical `<tool_calls>` wrapper 正常解析
- 别名变体（`<dsml|tool_calls>`、`<｜tool_calls>`、`<|tool_calls>`）、DSML 空格分隔 typo（如 `<|DSML tool_calls>`）和黏连 typo（如 `<DSMLtool_calls>`）正常解析
+- 固定本地标签名的 DSML 噪声容错形态（如 `<DSML|tool_calls>`、`<<|DSML|tool_calls>`、`<|DSML tool_calls>`、`<DSMLtool_calls>`、`<<DSML|DSML|tool_calls>`）正常解析
 - 混搭标签（DSML wrapper + canonical inner）归一化后正常解析
 - 波浪线围栏 `~~~` 内的示例不执行
 - 嵌套围栏（4 反引号嵌套 3 反引号）内的示例不执行
--- a/go.mod
+++ b/go.mod
@@ -6,10 +6,13 @@ require (
 	github.com/andybalholm/brotli v1.2.1
 	github.com/go-chi/chi/v5 v5.2.5
 	github.com/google/uuid v1.6.0
+	github.com/hupe1980/go-tiktoken v0.0.10
 	github.com/refraction-networking/utls v1.8.2
 	github.com/router-for-me/CLIProxyAPI/v6 v6.9.14
 )

+require github.com/dlclark/regexp2 v1.11.5 // indirect
+
 require (
 	github.com/klauspost/compress v1.18.5 // indirect
 	github.com/sirupsen/logrus v1.9.4 // indirect
--- a/go.sum
+++ b/go.sum
@@ -2,10 +2,14 @@ github.com/andybalholm/brotli v1.2.1 h1:R+f5xP285VArJDRgowrfb9DqL18yVK0gKAW/F+eT
 github.com/andybalholm/brotli v1.2.1/go.mod h1:rzTDkvFWvIrjDXZHkuS16NPggd91W3kUSvPlQ1pLaKY=
 github.com/davecgh/go-spew v1.1.1 h1:vj9j/u1bqnvCEfJOwUhtlOARqs3+rkHYY13jYWTU97c=
 github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
+github.com/dlclark/regexp2 v1.11.5 h1:Q/sSnsKerHeCkc/jSTNq1oCm7KiVgUMZRDUoRu0JQZQ=
+github.com/dlclark/regexp2 v1.11.5/go.mod h1:DHkYz0B9wPfa6wondMfaivmHpzrQ3v9q8cnmRbL6yW8=
 github.com/go-chi/chi/v5 v5.2.5 h1:Eg4myHZBjyvJmAFjFvWgrqDTXFyOzjj7YIm3L3mu6Ug=
 github.com/go-chi/chi/v5 v5.2.5/go.mod h1:X7Gx4mteadT3eDOMTsXzmI4/rwUpOwBHLpAfupzFJP0=
 github.com/google/uuid v1.6.0 h1:NIvaJDMOsjHA8n1jAhLSgzrAzy1Hgr+hNrb57e+94F0=
 github.com/google/uuid v1.6.0/go.mod h1:TIyPZe4MgqvfeYDBFedMoGGpEw/LqOeaOT+nhxU+yHo=
+github.com/hupe1980/go-tiktoken v0.0.10 h1:m6phOJaGyctqWdGIgwn9X8AfJvaG74tnQoDL+ntOUEQ=
+github.com/hupe1980/go-tiktoken v0.0.10/go.mod h1:NME6d8hrE+Jo+kLUZHhXShYV8e40hYkm4BbSLQKtvAo=
 github.com/klauspost/compress v1.18.5 h1:/h1gH5Ce+VWNLSWqPzOVn6XBO+vJbCNGvjoaGBFW2IE=
 github.com/klauspost/compress v1.18.5/go.mod h1:cwPg85FWrGar70rWktvGQj8/hthj3wpl0PGDogxkrSQ=
 github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM=
@@ -37,6 +41,8 @@ golang.org/x/net v0.52.0 h1:He/TN1l0e4mmR3QqHMT2Xab3Aj3L9qjbhRm78/6jrW0=
 golang.org/x/net v0.52.0/go.mod h1:R1MAz7uMZxVMualyPXb+VaqGSa3LIaUqk0eEt3w36Sw=
 golang.org/x/sys v0.42.0 h1:omrd2nAlyT5ESRdCLYdm3+fMfNFE/+Rf4bDIQImRJeo=
 golang.org/x/sys v0.42.0/go.mod h1:4GL1E5IUh+htKOUEOaiffhrAeqysfVGipDYzABqnCmw=
+golang.org/x/text v0.35.0 h1:JOVx6vVDFokkpaq1AEptVzLTpDe9KGpj5tR4/X+ybL8=
+golang.org/x/text v0.35.0/go.mod h1:khi/HExzZJ2pGnjenulevKNX1W67CUy0AsXcNubPGCA=
 gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405 h1:yhCVgyC4o1eVCa2tZl7eS0r+SDo693bJlVdllGtEeKM=
 gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
 gopkg.in/yaml.v3 v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA=
--- a/internal/chathistory/store.go
+++ b/internal/chathistory/store.go
@@ -14,6 +14,7 @@ import (
 	"github.com/google/uuid"

 	"ds2api/internal/config"
+	"ds2api/internal/util"
 )

 const (
@@ -309,8 +310,12 @@ func (s *Store) Update(id string, params UpdateParams) (Entry, error) {
 	if params.Status != "" {
 		item.Status = params.Status
 	}
-	item.ReasoningContent = params.ReasoningContent
-	item.Content = params.Content
+	if params.ReasoningContent != "" || item.ReasoningContent == "" {
+		item.ReasoningContent = params.ReasoningContent
+	}
+	if params.Content != "" || item.Content == "" {
+		item.Content = params.Content
+	}
 	item.Error = strings.TrimSpace(params.Error)
 	item.StatusCode = params.StatusCode
 	item.ElapsedMs = params.ElapsedMs
@@ -610,8 +615,8 @@ func buildPreview(item Entry) string {
 	if candidate == "" {
 		candidate = strings.TrimSpace(item.UserInput)
 	}
-	if len(candidate) > defaultPreviewAt {
-		return candidate[:defaultPreviewAt] + "..."
+	if truncated, ok := util.TruncateRunes(candidate, defaultPreviewAt); ok {
+		return truncated + "..."
 	}
 	return candidate
 }
--- a/internal/chathistory/store_test.go
+++ b/internal/chathistory/store_test.go
@@ -8,6 +8,7 @@ import (
 	"strings"
 	"sync"
 	"testing"
+	"unicode/utf8"
 )

 func blockDetailDir(t *testing.T, detailDir string) func() {
@@ -105,6 +106,17 @@ func TestStoreCreatesAndPersistsEntries(t *testing.T) {
 	}
 }

+func TestBuildPreviewPreservesUTF8MB4Characters(t *testing.T) {
+	long := strings.Repeat("😀", defaultPreviewAt+1)
+	preview := buildPreview(Entry{Content: long})
+	if !utf8.ValidString(preview) {
+		t.Fatalf("expected valid utf-8 preview, got %q", preview)
+	}
+	if preview != strings.Repeat("😀", defaultPreviewAt)+"..." {
+		t.Fatalf("unexpected preview: %q", preview)
+	}
+}
+
 func TestStoreTrimsToConfiguredLimit(t *testing.T) {
 	path := filepath.Join(t.TempDir(), "chat_history.json")
 	store := New(path)
@@ -481,3 +493,112 @@ func TestStoreWritesOnlyChangedDetailFiles(t *testing.T) {
 		t.Fatalf("expected untouched detail file to remain byte-identical")
 	}
 }
+
+func TestUpdatePreservesContentWhenNewContentIsEmpty(t *testing.T) {
+	path := filepath.Join(t.TempDir(), "chat_history.json")
+	store := New(path)
+
+	started, err := store.Start(StartParams{
+		CallerID:  "caller:abc",
+		Model:     "deepseek-v4-flash",
+		Stream:    true,
+		UserInput: "hello",
+	})
+	if err != nil {
+		t.Fatalf("start entry failed: %v", err)
+	}
+
+	if _, err := store.Update(started.ID, UpdateParams{
+		Status:           "streaming",
+		ReasoningContent: "let me think",
+		Content:          "I'll help you with that.",
+	}); err != nil {
+		t.Fatalf("progress update failed: %v", err)
+	}
+
+	updated, err := store.Update(started.ID, UpdateParams{
+		Status:    "success",
+		Content:   "",
+		Completed: true,
+	})
+	if err != nil {
+		t.Fatalf("success update failed: %v", err)
+	}
+
+	if updated.Content != "I'll help you with that." {
+		t.Fatalf("expected content to be preserved, got %q", updated.Content)
+	}
+	if updated.ReasoningContent != "let me think" {
+		t.Fatalf("expected reasoning content to be preserved, got %q", updated.ReasoningContent)
+	}
+
+	full, err := store.Get(started.ID)
+	if err != nil {
+		t.Fatalf("get entry failed: %v", err)
+	}
+	if full.Content != "I'll help you with that." {
+		t.Fatalf("expected persisted content to be preserved, got %q", full.Content)
+	}
+	if full.ReasoningContent != "let me think" {
+		t.Fatalf("expected persisted reasoning content to be preserved, got %q", full.ReasoningContent)
+	}
+}
+
+func TestUpdateAllowsSettingContentFromEmpty(t *testing.T) {
+	path := filepath.Join(t.TempDir(), "chat_history.json")
+	store := New(path)
+
+	started, err := store.Start(StartParams{
+		CallerID:  "caller:abc",
+		Model:     "deepseek-v4-flash",
+		Stream:    true,
+		UserInput: "hello",
+	})
+	if err != nil {
+		t.Fatalf("start entry failed: %v", err)
+	}
+
+	updated, err := store.Update(started.ID, UpdateParams{
+		Status:  "success",
+		Content: "final answer",
+	})
+	if err != nil {
+		t.Fatalf("update failed: %v", err)
+	}
+	if updated.Content != "final answer" {
+		t.Fatalf("expected content to be set, got %q", updated.Content)
+	}
+}
+
+func TestUpdateAllowsOverwritingContentWithNewValue(t *testing.T) {
+	path := filepath.Join(t.TempDir(), "chat_history.json")
+	store := New(path)
+
+	started, err := store.Start(StartParams{
+		CallerID:  "caller:abc",
+		Model:     "deepseek-v4-flash",
+		Stream:    true,
+		UserInput: "hello",
+	})
+	if err != nil {
+		t.Fatalf("start entry failed: %v", err)
+	}
+
+	if _, err := store.Update(started.ID, UpdateParams{
+		Status:  "streaming",
+		Content: "partial",
+	}); err != nil {
+		t.Fatalf("first update failed: %v", err)
+	}
+
+	updated, err := store.Update(started.ID, UpdateParams{
+		Status:  "success",
+		Content: "final answer",
+	})
+	if err != nil {
+		t.Fatalf("second update failed: %v", err)
+	}
+	if updated.Content != "final answer" {
+		t.Fatalf("expected content to be overwritten, got %q", updated.Content)
+	}
+}
--- a/internal/config/config_edge_test.go
+++ b/internal/config/config_edge_test.go
@@ -79,13 +79,20 @@ func TestGetModelConfigDeepSeekExpertReasonerSearch(t *testing.T) {
 	}
 }

-func TestGetModelConfigDeepSeekVisionReasonerSearch(t *testing.T) {
-	thinking, search, ok := GetModelConfig("deepseek-v4-vision-search")
+func TestGetModelConfigDeepSeekVision(t *testing.T) {
+	thinking, search, ok := GetModelConfig("deepseek-v4-vision")
 	if !ok {
-		t.Fatal("expected ok for deepseek-v4-vision-search")
+		t.Fatal("expected ok for deepseek-v4-vision")
 	}
-	if !thinking || !search {
-		t.Fatalf("expected both true, got thinking=%v search=%v", thinking, search)
+	if !thinking || search {
+		t.Fatalf("expected thinking=true search=false, got thinking=%v search=%v", thinking, search)
+	}
+}
+
+func TestGetModelConfigDeepSeekVisionSearchUnsupported(t *testing.T) {
+	_, _, ok := GetModelConfig("deepseek-v4-vision-search")
+	if ok {
+		t.Fatal("expected deepseek-v4-vision-search to be unsupported")
 	}
 }

@@ -748,18 +755,16 @@ func TestOpenAIModelsResponse(t *testing.T) {
 		t.Fatal("expected non-empty models list")
 	}
 	expected := map[string]bool{
-		"deepseek-v4-flash":                    false,
-		"deepseek-v4-flash-nothinking":         false,
-		"deepseek-v4-pro":                      false,
-		"deepseek-v4-pro-nothinking":           false,
-		"deepseek-v4-flash-search":             false,
-		"deepseek-v4-flash-search-nothinking":  false,
-		"deepseek-v4-pro-search":               false,
-		"deepseek-v4-pro-search-nothinking":    false,
-		"deepseek-v4-vision":                   false,
-		"deepseek-v4-vision-nothinking":        false,
-		"deepseek-v4-vision-search":            false,
-		"deepseek-v4-vision-search-nothinking": false,
+		"deepseek-v4-flash":                   false,
+		"deepseek-v4-flash-nothinking":        false,
+		"deepseek-v4-pro":                     false,
+		"deepseek-v4-pro-nothinking":          false,
+		"deepseek-v4-flash-search":            false,
+		"deepseek-v4-flash-search-nothinking": false,
+		"deepseek-v4-pro-search":              false,
+		"deepseek-v4-pro-search-nothinking":   false,
+		"deepseek-v4-vision":                  false,
+		"deepseek-v4-vision-nothinking":       false,
 	}
 	for _, model := range data {
 		if _, ok := expected[model.ID]; ok {
--- a/internal/config/model_alias_test.go
+++ b/internal/config/model_alias_test.go
@@ -144,10 +144,17 @@ func TestResolveModelCustomAliasToExpert(t *testing.T) {

 func TestResolveModelCustomAliasToVision(t *testing.T) {
 	got, ok := ResolveModel(mockModelAliasReader{
-		"my-vision-model": "deepseek-v4-vision-search",
+		"my-vision-model": "deepseek-v4-vision",
 	}, "my-vision-model")
-	if !ok || got != "deepseek-v4-vision-search" {
-		t.Fatalf("expected alias -> deepseek-v4-vision-search, got ok=%v model=%q", ok, got)
+	if !ok || got != "deepseek-v4-vision" {
+		t.Fatalf("expected alias -> deepseek-v4-vision, got ok=%v model=%q", ok, got)
+	}
+}
+
+func TestResolveModelHeuristicVisionIgnoresSearchSuffix(t *testing.T) {
+	got, ok := ResolveModel(nil, "gemini-vision-search")
+	if !ok || got != "deepseek-v4-vision" {
+		t.Fatalf("expected heuristic vision alias to resolve without search variant, got ok=%v model=%q", ok, got)
 	}
 }

--- a/internal/config/models.go
+++ b/internal/config/models.go
@@ -22,7 +22,6 @@ var deepSeekBaseModels = []ModelInfo{
 	{ID: "deepseek-v4-flash-search", Object: "model", Created: 1677610602, OwnedBy: "deepseek", Permission: []any{}},
 	{ID: "deepseek-v4-pro-search", Object: "model", Created: 1677610602, OwnedBy: "deepseek", Permission: []any{}},
 	{ID: "deepseek-v4-vision", Object: "model", Created: 1677610602, OwnedBy: "deepseek", Permission: []any{}},
-	{ID: "deepseek-v4-vision-search", Object: "model", Created: 1677610602, OwnedBy: "deepseek", Permission: []any{}},
 }

 var DeepSeekModels = appendNoThinkingVariants(deepSeekBaseModels)
@@ -67,7 +66,7 @@ func GetModelConfig(model string) (thinking bool, search bool, ok bool) {
 	switch baseModel {
 	case "deepseek-v4-flash", "deepseek-v4-pro", "deepseek-v4-vision":
 		return !noThinking, false, true
-	case "deepseek-v4-flash-search", "deepseek-v4-pro-search", "deepseek-v4-vision-search":
+	case "deepseek-v4-flash-search", "deepseek-v4-pro-search":
 		return !noThinking, true, true
 	default:
 		return false, false, false
@@ -81,7 +80,7 @@ func GetModelType(model string) (modelType string, ok bool) {
 		return "default", true
 	case "deepseek-v4-pro", "deepseek-v4-pro-search":
 		return "expert", true
-	case "deepseek-v4-vision", "deepseek-v4-vision-search":
+	case "deepseek-v4-vision":
 		return "vision", true
 	default:
 		return "", false
@@ -359,8 +358,6 @@ func resolveCanonicalModel(aliases map[string]string, model string) (string, boo
 	useSearch := strings.Contains(model, "search")

 	switch {
-	case useVision && useSearch:
-		return "deepseek-v4-vision-search", true
 	case useVision:
 		return "deepseek-v4-vision", true
 	case useReasoner && useSearch:
--- a/internal/config/paths.go
+++ b/internal/config/paths.go
@@ -30,9 +30,29 @@ func ResolvePath(envKey, defaultRel string) string {
 }

 func ConfigPath() string {
+	if strings.TrimSpace(os.Getenv("DS2API_CONFIG_PATH")) == "" && BaseDir() == "/app" {
+		return containerDefaultConfigPath()
+	}
 	return ResolvePath("DS2API_CONFIG_PATH", "config.json")
 }

+func containerDefaultConfigPath() string {
+	// Container images run as non-root by default. Only use /data when mounted/provisioned.
+	// Otherwise keep /app/config.json so admin-side save does not fail on MkdirAll("/data").
+	if st, err := os.Stat("/data"); err == nil && st.IsDir() {
+		return "/data/config.json"
+	}
+	return "/app/config.json"
+}
+
+func legacyContainerConfigPath() string {
+	return "/app/config.json"
+}
+
+func shouldTryLegacyContainerConfigPath() bool {
+	return strings.TrimSpace(os.Getenv("DS2API_CONFIG_PATH")) == "" && BaseDir() == "/app"
+}
+
 func RawStreamSampleRoot() string {
 	return ResolvePath("DS2API_RAW_STREAM_SAMPLE_ROOT", "tests/raw_stream_samples")
 }
--- a/internal/config/paths_test.go
+++ b/internal/config/paths_test.go
@@ -0,0 +1,28 @@
+package config
+
+import (
+	"os"
+	"testing"
+)
+
+func TestContainerDefaultConfigPath(t *testing.T) {
+	t.Run("fallback to /app when /data is missing", func(t *testing.T) {
+		// This test environment does not guarantee a writable/mounted /data.
+		// If /data is absent we must keep /app fallback to avoid persistence failures.
+		if _, err := os.Stat("/data"); err == nil {
+			t.Skip("/data exists in this environment; cannot validate missing-/data fallback")
+		}
+		if got := containerDefaultConfigPath(); got != "/app/config.json" {
+			t.Fatalf("containerDefaultConfigPath() = %q, want %q", got, "/app/config.json")
+		}
+	})
+
+	t.Run("prefer /data when /data directory exists", func(t *testing.T) {
+		if _, err := os.Stat("/data"); err != nil {
+			t.Skip("/data does not exist in this environment")
+		}
+		if got := containerDefaultConfigPath(); got != "/data/config.json" {
+			t.Fatalf("containerDefaultConfigPath() = %q, want %q", got, "/data/config.json")
+		}
+	})
+}
--- a/internal/config/store.go
+++ b/internal/config/store.go
@@ -87,12 +87,17 @@ func loadConfig() (Config, bool, error) {
 		}
 		return cfg, true, err
 	}
-
 	cfg, err := loadConfigFromFile(ConfigPath())
 	if err != nil {
+		if shouldTryLegacyContainerConfigPath() {
+			legacyPath := legacyContainerConfigPath()
+			if legacyCfg, legacyErr := loadConfigFromFile(legacyPath); legacyErr == nil {
+				Logger.Info("[config] loaded legacy container config path", "path", legacyPath)
+				return legacyCfg, false, nil
+			}
+		}
 		if IsVercel() {
-			// Vercel one-click deploy may start without a writable/present config file.
-			// Keep an in-memory config so users can bootstrap via WebUI then sync env.
+			// Vercel may start without writable/present config; keep in-memory bootstrap config.
 			return Config{}, true, nil
 		}
 		return Config{}, false, err
--- a/internal/deepseek/client/client_continue.go
+++ b/internal/deepseek/client/client_continue.go
@@ -7,6 +7,7 @@ import (
 	dsprotocol "ds2api/internal/deepseek/protocol"
 	"encoding/json"
 	"errors"
+	"fmt"
 	"io"
 	"net/http"
 	"strings"
@@ -27,7 +28,7 @@ type continueState struct {
 }

 // wrapCompletionWithAutoContinue wraps the completion response body so that
-// if the upstream indicates the response is incomplete (WIP / INCOMPLETE /
+// if the upstream indicates the response is incomplete (INCOMPLETE /
 // AUTO_CONTINUE), ds2api will automatically call the DeepSeek continue
 // endpoint and splice the continuation SSE stream onto the original.
 // The caller sees a single, seamless SSE stream.
@@ -132,33 +133,51 @@ func pumpAutoContinue(ctx context.Context, pw *io.PipeWriter, initial io.ReadClo
 // sentinels are consumed (not forwarded) so that the downstream only sees
 // one final [DONE] at the very end.
 func streamBodyWithContinueState(ctx context.Context, pw *io.PipeWriter, body io.Reader, state *continueState) (bool, error) {
-	scanner := bufio.NewScanner(body)
-	scanner.Buffer(make([]byte, 0, 64*1024), 2*1024*1024)
+	reader := bufio.NewReaderSize(body, 64*1024)
 	hadDone := false
-	for scanner.Scan() {
+	for {
 		select {
 		case <-ctx.Done():
 			return hadDone, ctx.Err()
 		default:
 		}
-		line := append([]byte{}, scanner.Bytes()...)
-		trimmed := strings.TrimSpace(string(line))
-		if trimmed == "" {
-			continue
-		}
-		if strings.HasPrefix(trimmed, "data:") {
-			data := strings.TrimSpace(strings.TrimPrefix(trimmed, "data:"))
-			if data == "[DONE]" {
-				hadDone = true
-				continue
+		line, err := reader.ReadBytes('\n')
+		if len(line) == 0 && err != nil {
+			if err == io.EOF {
+				return hadDone, nil
 			}
-			state.observe(data)
+			return hadDone, err
 		}
-		if _, err := io.Copy(pw, bytes.NewReader(append(line, '\n'))); err != nil {
+		trimmed := strings.TrimSpace(string(line))
+		if trimmed != "" {
+			if strings.HasPrefix(trimmed, "data:") {
+				data := strings.TrimSpace(strings.TrimPrefix(trimmed, "data:"))
+				if data == "[DONE]" {
+					hadDone = true
+					if err != nil && err != io.EOF {
+						return hadDone, err
+					}
+					if err == io.EOF {
+						return hadDone, nil
+					}
+					continue
+				}
+				state.observe(data)
+			}
+			if !strings.HasSuffix(string(line), "\n") {
+				line = append(line, '\n')
+			}
+			if _, copyErr := io.Copy(pw, bytes.NewReader(line)); copyErr != nil {
+				return hadDone, copyErr
+			}
+		}
+		if err != nil {
+			if err == io.EOF {
+				return hadDone, nil
+			}
 			return hadDone, err
 		}
 	}
-	return hadDone, scanner.Err()
 }

 // observe extracts continue-relevant signals from an SSE JSON chunk.
@@ -174,49 +193,100 @@ func (s *continueState) observe(data string) {
 	if id := intFrom(chunk["response_message_id"]); id > 0 {
 		s.responseMessageID = id
 	}
-	// Path-based status: {"p": "response/status", "v": "FINISHED"}
-	if p, _ := chunk["p"].(string); p == "response/status" {
-		if status, _ := chunk["v"].(string); status != "" {
-			s.lastStatus = strings.TrimSpace(status)
-			if strings.EqualFold(s.lastStatus, "FINISHED") {
-				s.finished = true
-			}
-		}
+	s.observeDirectPatch(asString(chunk["p"]), chunk["v"])
+	if p, _ := chunk["p"].(string); p == "response" {
+		s.observeBatchPatches("response", chunk["v"])
+	} else {
+		s.observeBatchPatches("", chunk["v"])
 	}
-	// Nested v.response
-	v, _ := chunk["v"].(map[string]any)
-	if response, _ := v["response"].(map[string]any); response != nil {
-		if id := intFrom(response["message_id"]); id > 0 {
-			s.responseMessageID = id
-		}
-		if status, _ := response["status"].(string); status != "" {
-			s.lastStatus = strings.TrimSpace(status)
-			if strings.EqualFold(s.lastStatus, "FINISHED") {
-				s.finished = true
-			}
-		}
-		if autoContinue, ok := response["auto_continue"].(bool); ok && autoContinue {
+	if v, _ := chunk["v"].(map[string]any); v != nil {
+		s.observeResponseObject(v["response"])
+	}
+	if message, _ := chunk["message"].(map[string]any); message != nil {
+		s.observeResponseObject(message["response"])
+	}
+}
+
+func (s *continueState) observeDirectPatch(path string, value any) {
+	if s == nil {
+		return
+	}
+	switch strings.Trim(strings.TrimSpace(path), "/") {
+	case "response/status", "status", "response/quasi_status", "quasi_status":
+		s.setStatus(asString(value))
+	case "response/auto_continue", "auto_continue":
+		if v, ok := value.(bool); ok && v {
 			s.lastStatus = "AUTO_CONTINUE"
 		}
 	}
-	// Nested message.response
-	if message, _ := chunk["message"].(map[string]any); message != nil {
-		if response, _ := message["response"].(map[string]any); response != nil {
-			if id := intFrom(response["message_id"]); id > 0 {
-				s.responseMessageID = id
-			}
-			if status, _ := response["status"].(string); status != "" {
-				s.lastStatus = strings.TrimSpace(status)
-				if strings.EqualFold(s.lastStatus, "FINISHED") {
-					s.finished = true
-				}
+}
+
+func (s *continueState) observeResponseObject(raw any) {
+	if s == nil {
+		return
+	}
+	response, _ := raw.(map[string]any)
+	if response == nil {
+		return
+	}
+	if id := intFrom(response["message_id"]); id > 0 {
+		s.responseMessageID = id
+	}
+	s.setStatus(asString(response["status"]))
+	if autoContinue, ok := response["auto_continue"].(bool); ok && autoContinue {
+		s.lastStatus = "AUTO_CONTINUE"
+	}
+}
+
+func (s *continueState) observeBatchPatches(parentPath string, raw any) {
+	if s == nil {
+		return
+	}
+	patches, ok := raw.([]any)
+	if !ok {
+		return
+	}
+	for _, patch := range patches {
+		m, ok := patch.(map[string]any)
+		if !ok {
+			continue
+		}
+		path := strings.TrimSpace(asString(m["p"]))
+		if path == "" {
+			continue
+		}
+		fullPath := path
+		if parent := strings.Trim(strings.TrimSpace(parentPath), "/"); parent != "" && !strings.Contains(path, "/") {
+			fullPath = parent + "/" + path
+		}
+		switch strings.Trim(strings.TrimSpace(fullPath), "/") {
+		case "response/status", "status", "response/quasi_status", "quasi_status":
+			s.setStatus(asString(m["v"]))
+		case "response/auto_continue", "auto_continue":
+			if v, ok := m["v"].(bool); ok && v {
+				s.lastStatus = "AUTO_CONTINUE"
 			}
 		}
 	}
 }

-// shouldContinue returns true when the upstream indicates the response is
-// not yet finished and we have enough information to issue a continue request.
+func (s *continueState) setStatus(status string) {
+	if s == nil {
+		return
+	}
+	normalized := strings.TrimSpace(status)
+	if normalized == "" {
+		return
+	}
+	s.lastStatus = normalized
+	if strings.EqualFold(normalized, "FINISHED") || strings.EqualFold(normalized, "CONTENT_FILTER") {
+		s.finished = true
+	}
+}
+
+// shouldContinue returns true when the upstream explicitly indicates the
+// response is incomplete and we have enough information to issue a continue
+// request. Plain WIP is not sufficient because normal streams begin in WIP.
 func (s *continueState) shouldContinue() bool {
 	if s == nil {
 		return false
@@ -225,7 +295,7 @@ func (s *continueState) shouldContinue() bool {
 		return false
 	}
 	switch strings.ToUpper(strings.TrimSpace(s.lastStatus)) {
-	case "WIP", "INCOMPLETE", "AUTO_CONTINUE":
+	case "INCOMPLETE", "AUTO_CONTINUE":
 		return true
 	default:
 		return false
@@ -241,3 +311,19 @@ func (s *continueState) prepareForNextRound() {
 	s.finished = false
 	s.lastStatus = ""
 }
+
+func asString(v any) string {
+	if v == nil {
+		return ""
+	}
+	switch x := v.(type) {
+	case string:
+		return x
+	default:
+		s := strings.TrimSpace(strings.ReplaceAll(strings.TrimSpace(fmt.Sprint(v)), "\u0000", ""))
+		if s == "<nil>" {
+			return ""
+		}
+		return s
+	}
+}
--- a/internal/deepseek/client/client_continue_test.go
+++ b/internal/deepseek/client/client_continue_test.go
@@ -8,6 +8,7 @@ import (
 	"io"
 	"net/http"
 	"strings"
+	"sync/atomic"
 	"testing"

 	"ds2api/internal/auth"
@@ -124,6 +125,146 @@ func TestCallCompletionAutoContinueThreadsPowHeader(t *testing.T) {
 	}
 }

+func TestAutoContinueDoesNotTriggerOnPlainWIPWithoutExplicitContinuationSignal(t *testing.T) {
+	initialBody := strings.Join([]string{
+		`data: {"response_message_id":321,"v":{"response":{"message_id":321,"status":"WIP","auto_continue":false}}}`,
+		`data: [DONE]`,
+	}, "\n") + "\n"
+
+	var continueCalls atomic.Int32
+	body := newAutoContinueBody(context.Background(), io.NopCloser(strings.NewReader(initialBody)), "session-123", 8, func(context.Context, string, int) (*http.Response, error) {
+		continueCalls.Add(1)
+		return nil, errors.New("continue should not have been called")
+	})
+	defer func() { _ = body.Close() }()
+
+	out, err := io.ReadAll(body)
+	if err != nil {
+		t.Fatalf("read body failed: %v", err)
+	}
+	if continueCalls.Load() != 0 {
+		t.Fatalf("expected no continue calls, got %d", continueCalls.Load())
+	}
+	if !bytes.Contains(out, []byte(`"status":"WIP"`)) || !bytes.Contains(out, []byte(`data: [DONE]`)) {
+		t.Fatalf("expected original body to pass through unchanged, got=%s", string(out))
+	}
+}
+
+func TestAutoContinuePassesThroughLongSingleSSELine(t *testing.T) {
+	payload := strings.Repeat("x", 2*1024*1024+4096)
+	initialBody := `data: {"p":"response/content","v":"` + payload + `"}` + "\n" +
+		`data: [DONE]` + "\n"
+
+	body := newAutoContinueBody(context.Background(), io.NopCloser(strings.NewReader(initialBody)), "session-123", 8, func(context.Context, string, int) (*http.Response, error) {
+		return nil, errors.New("continue should not have been called")
+	})
+	defer func() { _ = body.Close() }()
+
+	out, err := io.ReadAll(body)
+	if err != nil {
+		t.Fatalf("read body failed: %v", err)
+	}
+	if !bytes.Contains(out, []byte(payload)) {
+		t.Fatalf("expected long SSE payload to pass through, got len=%d want payload len=%d", len(out), len(payload))
+	}
+	if !bytes.Contains(out, []byte(`data: [DONE]`)) {
+		t.Fatalf("expected final DONE sentinel in body, got len=%d", len(out))
+	}
+}
+
+func TestAutoContinueTriggersOnDirectQuasiStatusIncomplete(t *testing.T) {
+	initialBody := strings.Join([]string{
+		`data: {"response_message_id":321,"p":"response/content","v":"<tool_calls><invoke name=\"write_file\"><parameter name=\"content\"><![CDATA[part-one"}`,
+		`data: {"p":"response/quasi_status","v":"INCOMPLETE"}`,
+		`data: [DONE]`,
+	}, "\n") + "\n"
+
+	var continueCalls atomic.Int32
+	body := newAutoContinueBody(context.Background(), io.NopCloser(strings.NewReader(initialBody)), "session-123", 8, func(context.Context, string, int) (*http.Response, error) {
+		continueCalls.Add(1)
+		return &http.Response{
+			StatusCode: http.StatusOK,
+			Header:     make(http.Header),
+			Body: io.NopCloser(strings.NewReader(
+				`data: {"response_message_id":322,"p":"response/content","v":"-part-two]]></parameter></invoke></tool_calls>"}` + "\n" +
+					`data: {"p":"response/status","v":"FINISHED"}` + "\n" +
+					`data: [DONE]` + "\n",
+			)),
+		}, nil
+	})
+	defer func() { _ = body.Close() }()
+
+	out, err := io.ReadAll(body)
+	if err != nil {
+		t.Fatalf("read body failed: %v", err)
+	}
+	if continueCalls.Load() != 1 {
+		t.Fatalf("expected exactly one continue call, got %d", continueCalls.Load())
+	}
+	if !bytes.Contains(out, []byte("part-one")) || !bytes.Contains(out, []byte("-part-two")) {
+		t.Fatalf("expected continued tool content in body, got=%s", string(out))
+	}
+}
+
+func TestAutoContinueTriggersOnResponseBatchQuasiStatusIncomplete(t *testing.T) {
+	initialBody := strings.Join([]string{
+		`data: {"response_message_id":321,"v":{"response":{"message_id":321,"status":"WIP","auto_continue":false}}}`,
+		`data: {"p":"response","o":"BATCH","v":[{"p":"accumulated_token_usage","v":2413},{"p":"quasi_status","v":"INCOMPLETE"}]}`,
+		`data: [DONE]`,
+	}, "\n") + "\n"
+
+	var continueCalls atomic.Int32
+	body := newAutoContinueBody(context.Background(), io.NopCloser(strings.NewReader(initialBody)), "session-123", 8, func(context.Context, string, int) (*http.Response, error) {
+		continueCalls.Add(1)
+		return &http.Response{
+			StatusCode: http.StatusOK,
+			Header:     make(http.Header),
+			Body: io.NopCloser(strings.NewReader(
+				`data: {"response_message_id":322,"p":"response/status","v":"FINISHED"}` + "\n" +
+					`data: [DONE]` + "\n",
+			)),
+		}, nil
+	})
+	defer func() { _ = body.Close() }()
+
+	out, err := io.ReadAll(body)
+	if err != nil {
+		t.Fatalf("read body failed: %v", err)
+	}
+	if continueCalls.Load() != 1 {
+		t.Fatalf("expected exactly one continue call, got %d", continueCalls.Load())
+	}
+	if !bytes.Contains(out, []byte(`"quasi_status","v":"INCOMPLETE"`)) || !bytes.Contains(out, []byte(`"v":"FINISHED"`)) {
+		t.Fatalf("expected continued output to include initial and final rounds, got=%s", string(out))
+	}
+}
+
+func TestAutoContinueDoesNotTriggerWhenResponseBatchQuasiStatusFinished(t *testing.T) {
+	initialBody := strings.Join([]string{
+		`data: {"response_message_id":321,"v":{"response":{"message_id":321,"status":"WIP","auto_continue":false}}}`,
+		`data: {"p":"response","o":"BATCH","v":[{"p":"accumulated_token_usage","v":2413},{"p":"quasi_status","v":"FINISHED"}]}`,
+		`data: [DONE]`,
+	}, "\n") + "\n"
+
+	var continueCalls atomic.Int32
+	body := newAutoContinueBody(context.Background(), io.NopCloser(strings.NewReader(initialBody)), "session-123", 8, func(context.Context, string, int) (*http.Response, error) {
+		continueCalls.Add(1)
+		return nil, errors.New("continue should not have been called")
+	})
+	defer func() { _ = body.Close() }()
+
+	out, err := io.ReadAll(body)
+	if err != nil {
+		t.Fatalf("read body failed: %v", err)
+	}
+	if continueCalls.Load() != 0 {
+		t.Fatalf("expected no continue calls, got %d", continueCalls.Load())
+	}
+	if !bytes.Contains(out, []byte(`"quasi_status","v":"FINISHED"`)) || !bytes.Contains(out, []byte(`data: [DONE]`)) {
+		t.Fatalf("expected original finished body to pass through unchanged, got=%s", string(out))
+	}
+}
+
 type failingOrCompletionDoer struct {
 	completionResp *http.Response
 }
@@ -134,3 +275,33 @@ func (d failingOrCompletionDoer) Do(req *http.Request) (*http.Response, error) {
 	}
 	return nil, errors.New("forced stream failure")
 }
+
+func TestAutoContinuePreservesIncompleteStateWhenNextChunkOmitsStatus(t *testing.T) {
+	initialBody := strings.Join([]string{
+		`data: {"response_message_id":321,"v":{"response":{"message_id":321,"status":"INCOMPLETE"}}}`,
+		`data: {"p":"response/content","v":{"text":"continued"}}`,
+		`data: [DONE]`,
+	}, "\n") + "\n"
+
+	var continueCalls atomic.Int32
+	body := newAutoContinueBody(context.Background(), io.NopCloser(strings.NewReader(initialBody)), "session-123", 8, func(context.Context, string, int) (*http.Response, error) {
+		continueCalls.Add(1)
+		return &http.Response{
+			StatusCode: http.StatusOK,
+			Header:     make(http.Header),
+			Body: io.NopCloser(strings.NewReader(
+				`data: {"response_message_id":322,"p":"response/status","v":"FINISHED"}` + "\n" +
+					`data: [DONE]` + "\n",
+			)),
+		}, nil
+	})
+	defer func() { _ = body.Close() }()
+
+	_, err := io.ReadAll(body)
+	if err != nil {
+		t.Fatalf("read body failed: %v", err)
+	}
+	if continueCalls.Load() != 1 {
+		t.Fatalf("expected exactly one continue call, got %d", continueCalls.Load())
+	}
+}
--- a/internal/deepseek/client/client_upload.go
+++ b/internal/deepseek/client/client_upload.go
@@ -23,6 +23,7 @@ type UploadFileRequest struct {
 	Filename    string
 	ContentType string
 	Purpose     string
+	ModelType   string
 	Data        []byte
 }

@@ -54,6 +55,7 @@ func (c *Client) UploadFile(ctx context.Context, a *auth.RequestAuth, req Upload
 		contentType = "application/octet-stream"
 	}
 	purpose := strings.TrimSpace(req.Purpose)
+	modelType := strings.ToLower(strings.TrimSpace(req.ModelType))
 	body, contentTypeHeader, err := buildUploadMultipartBody(filename, contentType, req.Data)
 	if err != nil {
 		return nil, err
@@ -64,6 +66,9 @@ func (c *Client) UploadFile(ctx context.Context, a *auth.RequestAuth, req Upload
 		"purpose":      purpose,
 		"bytes":        len(req.Data),
 	}
+	if modelType != "" {
+		capturePayload["model_type"] = modelType
+	}
 	captureSession := c.capture.Start("deepseek_upload_file", dsprotocol.DeepSeekUploadFileURL, a.AccountID, capturePayload)
 	attempts := 0
 	refreshed := false
@@ -81,6 +86,9 @@ func (c *Client) UploadFile(ctx context.Context, a *auth.RequestAuth, req Upload
 		}
 		headers := c.authHeaders(a.DeepSeekToken)
 		headers["Content-Type"] = contentTypeHeader
+		if modelType != "" {
+			headers["x-model-type"] = modelType
+		}
 		headers["x-ds-pow-response"] = powHeader
 		headers["x-file-size"] = strconv.Itoa(len(req.Data))
 		headers["x-thinking-enabled"] = "1"
--- a/internal/deepseek/client/client_upload_test.go
+++ b/internal/deepseek/client/client_upload_test.go
@@ -82,6 +82,7 @@ func TestUploadFileUsesUploadTargetPowAndMultipartHeaders(t *testing.T) {
 	var seenTargetPath string
 	var seenContentType string
 	var seenFileSize string
+	var seenModelType string
 	var seenBody string
 	call := 0
 	client := &Client{
@@ -96,6 +97,7 @@ func TestUploadFileUsesUploadTargetPowAndMultipartHeaders(t *testing.T) {
 				seenPow = req.Header.Get("x-ds-pow-response")
 				seenContentType = req.Header.Get("Content-Type")
 				seenFileSize = req.Header.Get("x-file-size")
+				seenModelType = req.Header.Get("x-model-type")
 				seenBody = string(bodyBytes)
 				return &http.Response{StatusCode: http.StatusOK, Header: make(http.Header), Body: io.NopCloser(strings.NewReader(uploadResponse)), Request: req}, nil
 			default:
@@ -112,6 +114,7 @@ func TestUploadFileUsesUploadTargetPowAndMultipartHeaders(t *testing.T) {
 		Filename:    "demo.txt",
 		ContentType: "text/plain",
 		Purpose:     "assistants",
+		ModelType:   "vision",
 		Data:        []byte("hello"),
 	}, 1)
 	if err != nil {
@@ -140,6 +143,9 @@ func TestUploadFileUsesUploadTargetPowAndMultipartHeaders(t *testing.T) {
 	if seenFileSize != "5" {
 		t.Fatalf("expected x-file-size=5, got %q", seenFileSize)
 	}
+	if seenModelType != "vision" {
+		t.Fatalf("expected x-model-type=vision, got %q", seenModelType)
+	}
 	if !strings.HasPrefix(seenContentType, "multipart/form-data; boundary=") {
 		t.Fatalf("expected multipart content type, got %q", seenContentType)
 	}
--- a/internal/deepseek/protocol/constants.go
+++ b/internal/deepseek/protocol/constants.go
@@ -159,6 +159,6 @@ func toStringSet(in []string) map[string]struct{} {

 const (
 	KeepAliveTimeout  = 5
-	StreamIdleTimeout = 90
-	MaxKeepaliveCount = 10
+	StreamIdleTimeout = 300
+	MaxKeepaliveCount = 40
 )
--- a/internal/deepseek/protocol/constants_shared.json
+++ b/internal/deepseek/protocol/constants_shared.json
@@ -2,7 +2,7 @@
  "client": {
    "name": "DeepSeek",
    "platform": "android",
-    "version": "2.0.1",
+    "version": "2.0.4",
    "android_api_level": "35",
    "locale": "zh_CN"
  },
@@ -24,4 +24,4 @@
  "skip_exact_paths": [
    "response/search_status"
  ]
-}
+}
--- a/internal/deepseek/protocol/sse.go
+++ b/internal/deepseek/protocol/sse.go
@@ -2,20 +2,24 @@ package protocol

 import (
 	"bufio"
+	"io"
 	"net/http"
 )

 func ScanSSELines(resp *http.Response, onLine func([]byte) bool) error {
-	scanner := bufio.NewScanner(resp.Body)
-	buf := make([]byte, 0, 64*1024)
-	scanner.Buffer(buf, 2*1024*1024)
-	for scanner.Scan() {
-		if !onLine(scanner.Bytes()) {
-			break
+	reader := bufio.NewReaderSize(resp.Body, 64*1024)
+	for {
+		line, err := reader.ReadBytes('\n')
+		if len(line) > 0 {
+			if !onLine(line) {
+				return nil
+			}
+		}
+		if err != nil {
+			if err == io.EOF {
+				return nil
+			}
+			return err
 		}
 	}
-	if err := scanner.Err(); err != nil {
-		return err
-	}
-	return nil
 }
--- a/internal/deepseek/protocol/sse_test.go
+++ b/internal/deepseek/protocol/sse_test.go
@@ -0,0 +1,26 @@
+package protocol
+
+import (
+	"io"
+	"net/http"
+	"strings"
+	"testing"
+)
+
+func TestScanSSELinesHandlesLongSingleLine(t *testing.T) {
+	payload := strings.Repeat("x", 2*1024*1024+4096)
+	body := "data: {\"p\":\"response/content\",\"v\":\"" + payload + "\"}\n"
+	resp := &http.Response{Body: io.NopCloser(strings.NewReader(body))}
+
+	var got string
+	err := ScanSSELines(resp, func(line []byte) bool {
+		got = string(line)
+		return true
+	})
+	if err != nil {
+		t.Fatalf("ScanSSELines returned error: %v", err)
+	}
+	if !strings.Contains(got, payload) {
+		t.Fatalf("long SSE line was not preserved: got len=%d want payload len=%d", len(got), len(payload))
+	}
+}
--- a/internal/devcapture/store.go
+++ b/internal/devcapture/store.go
@@ -10,6 +10,8 @@ import (
 	"sync"
 	"time"

+	"ds2api/internal/util"
+
 	"github.com/google/uuid"
 )

@@ -194,7 +196,8 @@ func (c *captureBody) append(chunk string) {
 	}
 	remain := maxLen - current
 	if len(chunk) > remain {
-		c.buf.WriteString(chunk[:remain])
+		truncated, _ := util.TruncateUTF8Bytes(chunk, remain)
+		c.buf.WriteString(truncated)
 		c.truncated = true
 		return
 	}
--- a/internal/devcapture/store_test.go
+++ b/internal/devcapture/store_test.go
@@ -4,6 +4,7 @@ import (
 	"io"
 	"strings"
 	"testing"
+	"unicode/utf8"
 )

 func TestNewFromEnvDefaults(t *testing.T) {
@@ -82,3 +83,28 @@ func TestWrapBodyTruncatesByLimit(t *testing.T) {
 		t.Fatalf("expected account id, got %q", items[0].AccountID)
 	}
 }
+
+func TestWrapBodyTruncatesUTF8WithoutBreakingRune(t *testing.T) {
+	s := &Store{enabled: true, limit: 5, maxBodyBytes: 5}
+	session := s.Start("test", "http://x", "acc1", map[string]any{"x": 1})
+	if session == nil {
+		t.Fatal("expected session")
+	}
+	rc := session.WrapBody(io.NopCloser(strings.NewReader("😀xy")), 200)
+	_, _ = io.ReadAll(rc)
+	_ = rc.Close()
+
+	items := s.Snapshot()
+	if len(items) != 1 {
+		t.Fatalf("expected 1 item, got %d", len(items))
+	}
+	if !utf8.ValidString(items[0].ResponseBody) {
+		t.Fatalf("expected valid utf-8 response body, got %q", items[0].ResponseBody)
+	}
+	if items[0].ResponseBody != "😀x" {
+		t.Fatalf("expected rune-safe truncation, got %q", items[0].ResponseBody)
+	}
+	if !items[0].ResponseTruncated {
+		t.Fatal("expected truncated flag true")
+	}
+}
--- a/internal/format/claude/render.go
+++ b/internal/format/claude/render.go
@@ -5,6 +5,7 @@ import (
 	"fmt"
 	"time"

+	"ds2api/internal/prompt"
 	"ds2api/internal/util"
 )

@@ -43,8 +44,23 @@ func BuildMessageResponse(messageID, model string, normalizedMessages []any, fin
 		"stop_reason":   stopReason,
 		"stop_sequence": nil,
 		"usage": map[string]any{
-			"input_tokens":  util.EstimateTokens(fmt.Sprintf("%v", normalizedMessages)),
-			"output_tokens": util.EstimateTokens(finalThinking) + util.EstimateTokens(finalText),
+			"input_tokens":  util.CountPromptTokens(prompt.MessagesPrepareWithThinking(claudeMessageMaps(normalizedMessages), false), model),
+			"output_tokens": util.CountOutputTokens(finalThinking, model) + util.CountOutputTokens(finalText, model),
 		},
 	}
 }
+
+func claudeMessageMaps(messages []any) []map[string]any {
+	if len(messages) == 0 {
+		return nil
+	}
+	out := make([]map[string]any, 0, len(messages))
+	for _, item := range messages {
+		msg, ok := item.(map[string]any)
+		if !ok {
+			continue
+		}
+		out = append(out, msg)
+	}
+	return out
+}
--- a/internal/format/openai/render_chat.go
+++ b/internal/format/openai/render_chat.go
@@ -6,12 +6,12 @@ import (
 	"time"
 )

-func BuildChatCompletion(completionID, model, finalPrompt, finalThinking, finalText string, toolNames []string) map[string]any {
+func BuildChatCompletion(completionID, model, finalPrompt, finalThinking, finalText string, toolNames []string, toolsRaw any) map[string]any {
 	detected := toolcall.ParseAssistantToolCallsDetailed(finalText, finalThinking, toolNames)
-	return BuildChatCompletionWithToolCalls(completionID, model, finalPrompt, finalThinking, finalText, detected.Calls)
+	return BuildChatCompletionWithToolCalls(completionID, model, finalPrompt, finalThinking, finalText, detected.Calls, toolsRaw)
 }

-func BuildChatCompletionWithToolCalls(completionID, model, finalPrompt, finalThinking, finalText string, detected []toolcall.ParsedToolCall) map[string]any {
+func BuildChatCompletionWithToolCalls(completionID, model, finalPrompt, finalThinking, finalText string, detected []toolcall.ParsedToolCall, toolsRaw any) map[string]any {
 	finishReason := "stop"
 	messageObj := map[string]any{"role": "assistant", "content": finalText}
 	if strings.TrimSpace(finalThinking) != "" {
@@ -19,7 +19,7 @@ func BuildChatCompletionWithToolCalls(completionID, model, finalPrompt, finalThi
 	}
 	if len(detected) > 0 {
 		finishReason = "tool_calls"
-		messageObj["tool_calls"] = toolcall.FormatOpenAIToolCalls(detected)
+		messageObj["tool_calls"] = toolcall.FormatOpenAIToolCalls(detected, toolsRaw)
 		messageObj["content"] = nil
 	}

@@ -29,7 +29,7 @@ func BuildChatCompletionWithToolCalls(completionID, model, finalPrompt, finalThi
 		"created": time.Now().Unix(),
 		"model":   model,
 		"choices": []map[string]any{{"index": 0, "message": messageObj, "finish_reason": finishReason}},
-		"usage":   BuildChatUsage(finalPrompt, finalThinking, finalText),
+		"usage":   BuildChatUsageForModel(model, finalPrompt, finalThinking, finalText, 0),
 	}
 }

--- a/internal/format/openai/render_responses.go
+++ b/internal/format/openai/render_responses.go
@@ -9,19 +9,19 @@ import (
 	"github.com/google/uuid"
 )

-func BuildResponseObject(responseID, model, finalPrompt, finalThinking, finalText string, toolNames []string) map[string]any {
+func BuildResponseObject(responseID, model, finalPrompt, finalThinking, finalText string, toolNames []string, toolsRaw any) map[string]any {
 	// Strict mode: only standalone, structured tool-call payloads are treated
 	// as executable tool calls.
 	detected := toolcall.ParseAssistantToolCallsDetailed(finalText, finalThinking, toolNames)
-	return BuildResponseObjectWithToolCalls(responseID, model, finalPrompt, finalThinking, finalText, detected.Calls)
+	return BuildResponseObjectWithToolCalls(responseID, model, finalPrompt, finalThinking, finalText, detected.Calls, toolsRaw)
 }

-func BuildResponseObjectWithToolCalls(responseID, model, finalPrompt, finalThinking, finalText string, detected []toolcall.ParsedToolCall) map[string]any {
+func BuildResponseObjectWithToolCalls(responseID, model, finalPrompt, finalThinking, finalText string, detected []toolcall.ParsedToolCall, toolsRaw any) map[string]any {
 	exposedOutputText := finalText
 	output := make([]any, 0, 2)
 	if len(detected) > 0 {
 		exposedOutputText = ""
-		output = append(output, toResponsesFunctionCallItems(detected)...)
+		output = append(output, toResponsesFunctionCallItems(detected, toolsRaw)...)
 	} else {
 		content := make([]any, 0, 2)
 		if finalThinking != "" {
@@ -70,16 +70,17 @@ func BuildResponseObjectFromItems(responseID, model, finalPrompt, finalThinking,
 		"model":       model,
 		"output":      output,
 		"output_text": outputText,
-		"usage":       BuildResponsesUsage(finalPrompt, finalThinking, finalText),
+		"usage":       BuildResponsesUsageForModel(model, finalPrompt, finalThinking, finalText, 0),
 	}
 }

-func toResponsesFunctionCallItems(toolCalls []toolcall.ParsedToolCall) []any {
+func toResponsesFunctionCallItems(toolCalls []toolcall.ParsedToolCall, toolsRaw any) []any {
 	if len(toolCalls) == 0 {
 		return nil
 	}
+	normalizedCalls := toolcall.NormalizeParsedToolCallsForSchemas(toolCalls, toolsRaw)
 	out := make([]any, 0, len(toolCalls))
-	for _, tc := range toolCalls {
+	for _, tc := range normalizedCalls {
 		if strings.TrimSpace(tc.Name) == "" {
 			continue
 		}
--- a/internal/format/openai/render_test.go
+++ b/internal/format/openai/render_test.go
@@ -1,8 +1,12 @@
 package openai

 import (
+	"encoding/json"
 	"strings"
 	"testing"
+
+	"ds2api/internal/toolcall"
+	"ds2api/internal/util"
 )

 func TestBuildResponseObjectKeepsFencedToolPayloadAsText(t *testing.T) {
@@ -13,6 +17,7 @@ func TestBuildResponseObjectKeepsFencedToolPayloadAsText(t *testing.T) {
 		"",
 		"```json\n{\"tool_calls\":[{\"name\":\"search\",\"input\":{\"q\":\"golang\"}}]}\n```",
 		[]string{"search"},
+		nil,
 	)

 	outputText, _ := obj["output_text"].(string)
@@ -42,6 +47,7 @@ func TestBuildResponseObjectReasoningOnlyFallsBackToOutputText(t *testing.T) {
 		"internal thinking content",
 		"",
 		nil,
+		nil,
 	)

 	outputText, _ := obj["output_text"].(string)
@@ -75,6 +81,7 @@ func TestBuildResponseObjectPromotesToolCallFromThinkingWhenTextEmpty(t *testing
 		`<tool_calls><invoke name="search"><parameter name="q">from-thinking</parameter></invoke></tool_calls>`,
 		"",
 		[]string{"search"},
+		nil,
 	)

 	output, _ := obj["output"].([]any)
@@ -86,3 +93,102 @@ func TestBuildResponseObjectPromotesToolCallFromThinkingWhenTextEmpty(t *testing
 		t.Fatalf("expected function_call output, got %#v", first["type"])
 	}
 }
+
+func TestBuildChatCompletionWithToolCallsCoercesSchemaDeclaredStringArguments(t *testing.T) {
+	toolsRaw := []any{
+		map[string]any{
+			"type": "function",
+			"function": map[string]any{
+				"name": "Write",
+				"parameters": map[string]any{
+					"type": "object",
+					"properties": map[string]any{
+						"content": map[string]any{"type": "string"},
+						"taskId":  map[string]any{"type": "string"},
+					},
+				},
+			},
+		},
+	}
+	obj := BuildChatCompletionWithToolCalls(
+		"chat_test",
+		"gpt-4o",
+		"prompt",
+		"",
+		"",
+		[]toolcall.ParsedToolCall{{
+			Name: "Write",
+			Input: map[string]any{
+				"content": map[string]any{"message": "hi"},
+				"taskId":  1,
+			},
+		}},
+		toolsRaw,
+	)
+	choices, _ := obj["choices"].([]map[string]any)
+	message, _ := choices[0]["message"].(map[string]any)
+	toolCalls, _ := message["tool_calls"].([]map[string]any)
+	fn, _ := toolCalls[0]["function"].(map[string]any)
+	args := map[string]any{}
+	if err := json.Unmarshal([]byte(fn["arguments"].(string)), &args); err != nil {
+		t.Fatalf("decode arguments failed: %v", err)
+	}
+	if args["content"] != `{"message":"hi"}` {
+		t.Fatalf("expected content stringified by schema, got %#v", args["content"])
+	}
+	if args["taskId"] != "1" {
+		t.Fatalf("expected taskId stringified by schema, got %#v", args["taskId"])
+	}
+}
+
+func TestBuildResponseObjectWithToolCallsCoercesSchemaDeclaredStringArguments(t *testing.T) {
+	toolsRaw := []any{
+		map[string]any{
+			"type": "function",
+			"function": map[string]any{
+				"name": "Write",
+				"parameters": map[string]any{
+					"type": "object",
+					"properties": map[string]any{
+						"content": map[string]any{"type": "string"},
+					},
+				},
+			},
+		},
+	}
+	obj := BuildResponseObjectWithToolCalls(
+		"resp_test",
+		"gpt-4o",
+		"prompt",
+		"",
+		"",
+		[]toolcall.ParsedToolCall{{
+			Name:  "Write",
+			Input: map[string]any{"content": []any{"a", 1}},
+		}},
+		toolsRaw,
+	)
+	output, _ := obj["output"].([]any)
+	first, _ := output[0].(map[string]any)
+	args := map[string]any{}
+	if err := json.Unmarshal([]byte(first["arguments"].(string)), &args); err != nil {
+		t.Fatalf("decode response arguments failed: %v", err)
+	}
+	if args["content"] != `["a",1]` {
+		t.Fatalf("expected response content stringified by schema, got %#v", args["content"])
+	}
+}
+
+func TestBuildChatUsageForModelUsesConservativePromptCount(t *testing.T) {
+	prompt := strings.Repeat("上下文token ", 40)
+	usage := BuildChatUsageForModel("deepseek-v4-flash", prompt, "", "ok", 0)
+	promptTokens, _ := usage["prompt_tokens"].(int)
+	if promptTokens <= util.EstimateTokens(prompt) {
+		t.Fatalf("expected conservative prompt token count > rough estimate, got=%d estimate=%d", promptTokens, util.EstimateTokens(prompt))
+	}
+	totalTokens, _ := usage["total_tokens"].(int)
+	completionTokens, _ := usage["completion_tokens"].(int)
+	if totalTokens != promptTokens+completionTokens {
+		t.Fatalf("expected total tokens to add up, got usage=%#v", usage)
+	}
+}
--- a/internal/format/openai/render_usage.go
+++ b/internal/format/openai/render_usage.go
@@ -2,10 +2,10 @@ package openai

 import "ds2api/internal/util"

-func BuildChatUsage(finalPrompt, finalThinking, finalText string) map[string]any {
-	promptTokens := util.EstimateTokens(finalPrompt)
-	reasoningTokens := util.EstimateTokens(finalThinking)
-	completionTokens := util.EstimateTokens(finalText)
+func BuildChatUsageForModel(model, finalPrompt, finalThinking, finalText string, refFileTokens int) map[string]any {
+	promptTokens := util.CountPromptTokens(finalPrompt, model) + refFileTokens
+	reasoningTokens := util.CountOutputTokens(finalThinking, model)
+	completionTokens := util.CountOutputTokens(finalText, model)
 	return map[string]any{
 		"prompt_tokens":     promptTokens,
 		"completion_tokens": reasoningTokens + completionTokens,
@@ -16,13 +16,21 @@ func BuildChatUsage(finalPrompt, finalThinking, finalText string) map[string]any
 	}
 }

-func BuildResponsesUsage(finalPrompt, finalThinking, finalText string) map[string]any {
-	promptTokens := util.EstimateTokens(finalPrompt)
-	reasoningTokens := util.EstimateTokens(finalThinking)
-	completionTokens := util.EstimateTokens(finalText)
+func BuildChatUsage(finalPrompt, finalThinking, finalText string) map[string]any {
+	return BuildChatUsageForModel("", finalPrompt, finalThinking, finalText, 0)
+}
+
+func BuildResponsesUsageForModel(model, finalPrompt, finalThinking, finalText string, refFileTokens int) map[string]any {
+	promptTokens := util.CountPromptTokens(finalPrompt, model) + refFileTokens
+	reasoningTokens := util.CountOutputTokens(finalThinking, model)
+	completionTokens := util.CountOutputTokens(finalText, model)
 	return map[string]any{
 		"input_tokens":  promptTokens,
 		"output_tokens": reasoningTokens + completionTokens,
 		"total_tokens":  promptTokens + reasoningTokens + completionTokens,
 	}
 }
+
+func BuildResponsesUsage(finalPrompt, finalThinking, finalText string) map[string]any {
+	return BuildResponsesUsageForModel("", finalPrompt, finalThinking, finalText, 0)
+}
--- a/internal/httpapi/admin/accounts/handler_accounts_testing.go
+++ b/internal/httpapi/admin/accounts/handler_accounts_testing.go
@@ -107,6 +107,7 @@ func (h *Handler) testAccount(ctx context.Context, acc config.Account, model, me
 		"model":           model,
 		"session_count":   0,
 		"config_writable": !h.Store.IsEnvBacked(),
+		"config_warning":  "",
 	}
 	defer func() {
 		status := "failed"
@@ -121,8 +122,7 @@ func (h *Handler) testAccount(ctx context.Context, acc config.Account, model, me
 		return result
 	}
 	if err := h.Store.UpdateAccountToken(acc.Identifier(), token); err != nil {
-		result["message"] = "登录成功但写入运行时 token 失败: " + err.Error()
-		return result
+		result["config_warning"] = "登录成功，但 token 持久化失败（仅保存在内存，重启后会丢失）: " + err.Error()
 	}
 	authCtx := &authn.RequestAuth{UseConfigToken: false, DeepSeekToken: token, AccountID: identifier, Account: acc}
 	proxyCtx := authn.WithAuth(ctx, authCtx)
@@ -136,8 +136,7 @@ func (h *Handler) testAccount(ctx context.Context, acc config.Account, model, me
 		token = newToken
 		authCtx.DeepSeekToken = token
 		if err := h.Store.UpdateAccountToken(acc.Identifier(), token); err != nil {
-			result["message"] = "刷新 token 成功但写入运行时 token 失败: " + err.Error()
-			return result
+			result["config_warning"] = "刷新 token 成功，但 token 持久化失败（仅保存在内存，重启后会丢失）: " + err.Error()
 		}
 		sessionID, err = h.DS.CreateSession(proxyCtx, authCtx, 1)
 		if err != nil {
@@ -155,6 +154,9 @@ func (h *Handler) testAccount(ctx context.Context, acc config.Account, model, me
 	if strings.TrimSpace(message) == "" {
 		result["success"] = true
 		result["message"] = "Token 刷新成功（登录与会话创建成功）"
+		if warning, _ := result["config_warning"].(string); strings.TrimSpace(warning) != "" {
+			result["message"] = result["message"].(string) + "；" + warning
+		}
 		result["response_time"] = int(time.Since(start).Milliseconds())
 		return result
 	}
--- a/internal/httpapi/admin/rawsamples/handler_raw_samples.go
+++ b/internal/httpapi/admin/rawsamples/handler_raw_samples.go
@@ -15,6 +15,7 @@ import (
 	"ds2api/internal/devcapture"
 	adminshared "ds2api/internal/httpapi/admin/shared"
 	"ds2api/internal/rawsample"
+	"ds2api/internal/util"
 )

 type captureChain struct {
@@ -479,10 +480,13 @@ func previewCaptureChainResponse(chain captureChain) string {

 func previewText(text string, limit int) string {
 	text = strings.TrimSpace(text)
-	if limit <= 0 || len(text) <= limit {
+	if limit <= 0 {
 		return text
 	}
-	return text[:limit] + "..."
+	if truncated, ok := util.TruncateRunes(text, limit); ok {
+		return truncated + "..."
+	}
+	return text
 }

 func captureChainHasTruncatedResponse(chain captureChain) bool {
--- a/internal/httpapi/admin/rawsamples/handler_raw_samples_test.go
+++ b/internal/httpapi/admin/rawsamples/handler_raw_samples_test.go
@@ -10,6 +10,7 @@ import (
 	"path/filepath"
 	"strings"
 	"testing"
+	"unicode/utf8"

 	"ds2api/internal/devcapture"
 )
@@ -231,6 +232,16 @@ func TestCombineCaptureBodiesPreservesOrderAndSeparators(t *testing.T) {
 	}
 }

+func TestPreviewTextPreservesUTF8MB4Characters(t *testing.T) {
+	preview := previewText(strings.Repeat("😀", 281), 280)
+	if !utf8.ValidString(preview) {
+		t.Fatalf("expected valid utf-8 preview, got %q", preview)
+	}
+	if preview != strings.Repeat("😀", 280)+"..." {
+		t.Fatalf("unexpected preview: %q", preview)
+	}
+}
+
 func TestQueryRawSampleCapturesGroupsBySessionAndMatchesQuestion(t *testing.T) {
 	devcapture.Global().Clear()
 	defer devcapture.Global().Clear()
--- a/internal/httpapi/claude/handler_helpers_misc.go
+++ b/internal/httpapi/claude/handler_helpers_misc.go
@@ -1,6 +1,7 @@
 package claude

 import (
+	"ds2api/internal/toolcall"
 	"fmt"
 	"strings"
 )
@@ -31,30 +32,9 @@ func extractClaudeToolNames(tools []any) []string {
 }

 func extractClaudeToolMeta(m map[string]any) (string, string, any) {
-	name, _ := m["name"].(string)
-	desc, _ := m["description"].(string)
-	schemaObj := m["input_schema"]
-	if schemaObj == nil {
-		schemaObj = m["parameters"]
-	}
-
-	if fn, ok := m["function"].(map[string]any); ok {
-		if strings.TrimSpace(name) == "" {
-			name, _ = fn["name"].(string)
-		}
-		if strings.TrimSpace(desc) == "" {
-			desc, _ = fn["description"].(string)
-		}
-		if schemaObj == nil {
-			if v, ok := fn["input_schema"]; ok {
-				schemaObj = v
-			}
-		}
-		if schemaObj == nil {
-			if v, ok := fn["parameters"]; ok {
-				schemaObj = v
-			}
-		}
+	name, desc, schemaObj := toolcall.ExtractToolMeta(m)
+	if strings.TrimSpace(desc) == "" {
+		desc = "No description available"
 	}
 	return strings.TrimSpace(name), strings.TrimSpace(desc), schemaObj
 }
--- a/internal/httpapi/claude/handler_messages.go
+++ b/internal/httpapi/claude/handler_messages.go
@@ -3,12 +3,14 @@ package claude
 import (
 	"bytes"
 	"encoding/json"
+	"errors"
 	"io"
 	"net/http"
 	"net/http/httptest"
 	"strings"

 	"ds2api/internal/config"
+	"ds2api/internal/httpapi/requestbody"
 	streamengine "ds2api/internal/stream"
 	"ds2api/internal/translatorcliproxy"
 	"ds2api/internal/util"
@@ -33,7 +35,11 @@ func (h *Handler) Messages(w http.ResponseWriter, r *http.Request) {
 func (h *Handler) proxyViaOpenAI(w http.ResponseWriter, r *http.Request, store ConfigReader) bool {
 	raw, err := io.ReadAll(r.Body)
 	if err != nil {
-		writeClaudeError(w, http.StatusBadRequest, "invalid body")
+		if errors.Is(err, requestbody.ErrInvalidUTF8Body) {
+			writeClaudeError(w, http.StatusBadRequest, "invalid json")
+		} else {
+			writeClaudeError(w, http.StatusBadRequest, "invalid body")
+		}
 		return true
 	}
 	var req map[string]any
@@ -177,7 +183,7 @@ func stripClaudeThinkingBlocks(raw []byte) []byte {
 	return out
 }

-func (h *Handler) handleClaudeStreamRealtime(w http.ResponseWriter, r *http.Request, resp *http.Response, model string, messages []any, thinkingEnabled, searchEnabled bool, toolNames []string) {
+func (h *Handler) handleClaudeStreamRealtime(w http.ResponseWriter, r *http.Request, resp *http.Response, model string, messages []any, thinkingEnabled, searchEnabled bool, toolNames []string, toolsRaw any) {
 	defer func() { _ = resp.Body.Close() }()
 	if resp.StatusCode != http.StatusOK {
 		body, _ := io.ReadAll(resp.Body)
@@ -205,6 +211,8 @@ func (h *Handler) handleClaudeStreamRealtime(w http.ResponseWriter, r *http.Requ
 		searchEnabled,
 		h.compatStripReferenceMarkers(),
 		toolNames,
+		toolsRaw,
+		buildClaudePromptTokenText(messages, thinkingEnabled),
 	)
 	streamRuntime.sendMessageStart()

--- a/internal/httpapi/claude/handler_stream_test.go
+++ b/internal/httpapi/claude/handler_stream_test.go
@@ -81,7 +81,7 @@ func TestHandleClaudeStreamRealtimeTextIncrementsWithEventHeaders(t *testing.T)
 	rec := httptest.NewRecorder()
 	req := httptest.NewRequest(http.MethodPost, "/anthropic/v1/messages", nil)

-	h.handleClaudeStreamRealtime(rec, req, resp, "claude-sonnet-4-5", []any{map[string]any{"role": "user", "content": "hi"}}, false, false, nil)
+	h.handleClaudeStreamRealtime(rec, req, resp, "claude-sonnet-4-5", []any{map[string]any{"role": "user", "content": "hi"}}, false, false, nil, nil)

 	body := rec.Body.String()
 	if !strings.Contains(body, "event: message_start") {
@@ -122,7 +122,7 @@ func TestHandleClaudeStreamRealtimeThinkingDelta(t *testing.T) {
 	rec := httptest.NewRecorder()
 	req := httptest.NewRequest(http.MethodPost, "/anthropic/v1/messages", nil)

-	h.handleClaudeStreamRealtime(rec, req, resp, "claude-sonnet-4-5", []any{map[string]any{"role": "user", "content": "hi"}}, true, false, nil)
+	h.handleClaudeStreamRealtime(rec, req, resp, "claude-sonnet-4-5", []any{map[string]any{"role": "user", "content": "hi"}}, true, false, nil, nil)

 	frames := parseClaudeFrames(t, rec.Body.String())
 	foundThinkingDelta := false
@@ -149,7 +149,7 @@ func TestHandleClaudeStreamRealtimeSkipsThinkingFallbackWhenFinalTextExists(t *t
 	rec := httptest.NewRecorder()
 	req := httptest.NewRequest(http.MethodPost, "/anthropic/v1/messages", nil)

-	h.handleClaudeStreamRealtime(rec, req, resp, "claude-sonnet-4-5", []any{map[string]any{"role": "user", "content": "use tool"}}, true, false, []string{"search"})
+	h.handleClaudeStreamRealtime(rec, req, resp, "claude-sonnet-4-5", []any{map[string]any{"role": "user", "content": "use tool"}}, true, false, []string{"search"}, nil)

 	frames := parseClaudeFrames(t, rec.Body.String())
 	for _, f := range findClaudeFrames(frames, "content_block_start") {
@@ -180,7 +180,7 @@ func TestHandleClaudeStreamRealtimeUpstreamErrorEvent(t *testing.T) {
 	rec := httptest.NewRecorder()
 	req := httptest.NewRequest(http.MethodPost, "/anthropic/v1/messages", nil)

-	h.handleClaudeStreamRealtime(rec, req, resp, "claude-sonnet-4-5", []any{map[string]any{"role": "user", "content": "hi"}}, false, false, nil)
+	h.handleClaudeStreamRealtime(rec, req, resp, "claude-sonnet-4-5", []any{map[string]any{"role": "user", "content": "hi"}}, false, false, nil, nil)

 	frames := parseClaudeFrames(t, rec.Body.String())
 	errFrames := findClaudeFrames(frames, "error")
@@ -217,7 +217,7 @@ func TestHandleClaudeStreamRealtimePingEvent(t *testing.T) {

 	rec := httptest.NewRecorder()
 	req := httptest.NewRequest(http.MethodPost, "/anthropic/v1/messages", nil)
-	h.handleClaudeStreamRealtime(rec, req, resp, "claude-sonnet-4-5", []any{map[string]any{"role": "user", "content": "hi"}}, false, false, nil)
+	h.handleClaudeStreamRealtime(rec, req, resp, "claude-sonnet-4-5", []any{map[string]any{"role": "user", "content": "hi"}}, false, false, nil, nil)

 	frames := parseClaudeFrames(t, rec.Body.String())
 	if len(findClaudeFrames(frames, "ping")) == 0 {
@@ -271,7 +271,7 @@ func TestHandleClaudeStreamRealtimeToolSafetyAcrossStructuredFormats(t *testing.
 			rec := httptest.NewRecorder()
 			req := httptest.NewRequest(http.MethodPost, "/anthropic/v1/messages", nil)

-			h.handleClaudeStreamRealtime(rec, req, resp, "claude-sonnet-4-5", []any{map[string]any{"role": "user", "content": "use tool"}}, false, false, []string{"Bash"})
+			h.handleClaudeStreamRealtime(rec, req, resp, "claude-sonnet-4-5", []any{map[string]any{"role": "user", "content": "use tool"}}, false, false, []string{"Bash"}, nil)

 			frames := parseClaudeFrames(t, rec.Body.String())
 			foundToolUse := false
@@ -299,7 +299,7 @@ func TestHandleClaudeStreamRealtimeDetectsToolUseWithLeadingProse(t *testing.T)
 	rec := httptest.NewRecorder()
 	req := httptest.NewRequest(http.MethodPost, "/anthropic/v1/messages", nil)

-	h.handleClaudeStreamRealtime(rec, req, resp, "claude-sonnet-4-5", []any{map[string]any{"role": "user", "content": "use tool"}}, false, false, []string{"write_file"})
+	h.handleClaudeStreamRealtime(rec, req, resp, "claude-sonnet-4-5", []any{map[string]any{"role": "user", "content": "use tool"}}, false, false, []string{"write_file"}, nil)

 	frames := parseClaudeFrames(t, rec.Body.String())
 	foundToolUse := false
@@ -333,7 +333,7 @@ func TestHandleClaudeStreamRealtimeIgnoresUnclosedFencedToolExample(t *testing.T
 	rec := httptest.NewRecorder()
 	req := httptest.NewRequest(http.MethodPost, "/anthropic/v1/messages", nil)

-	h.handleClaudeStreamRealtime(rec, req, resp, "claude-sonnet-4-5", []any{map[string]any{"role": "user", "content": "show example only"}}, false, false, []string{"Bash"})
+	h.handleClaudeStreamRealtime(rec, req, resp, "claude-sonnet-4-5", []any{map[string]any{"role": "user", "content": "show example only"}}, false, false, []string{"Bash"}, nil)

 	frames := parseClaudeFrames(t, rec.Body.String())
 	foundToolUse := false
@@ -365,3 +365,48 @@ func TestHandleClaudeStreamRealtimeIgnoresUnclosedFencedToolExample(t *testing.T
 func TestHandleClaudeStreamRealtimePromotesUnclosedFencedToolExample(t *testing.T) {
 	TestHandleClaudeStreamRealtimeIgnoresUnclosedFencedToolExample(t)
 }
+
+func TestHandleClaudeStreamRealtimeNormalizesToolInputBySchema(t *testing.T) {
+	h := &Handler{}
+	resp := makeClaudeSSEHTTPResponse(
+		`data: {"p":"response/content","v":"<tool_calls><invoke name=\"Write\">{\"input\":{\"content\":{\"message\":\"hi\"},\"taskId\":1}}</invoke></tool_calls>"}`,
+		`data: [DONE]`,
+	)
+	rec := httptest.NewRecorder()
+	req := httptest.NewRequest(http.MethodPost, "/anthropic/v1/messages", nil)
+	toolsRaw := []any{
+		map[string]any{
+			"name": "Write",
+			"inputSchema": map[string]any{
+				"type": "object",
+				"properties": map[string]any{
+					"content": map[string]any{"type": "string"},
+					"taskId":  map[string]any{"type": "string"},
+				},
+			},
+		},
+	}
+
+	h.handleClaudeStreamRealtime(rec, req, resp, "claude-sonnet-4-5", []any{map[string]any{"role": "user", "content": "write"}}, false, false, []string{"Write"}, toolsRaw)
+
+	frames := parseClaudeFrames(t, rec.Body.String())
+	for _, f := range findClaudeFrames(frames, "content_block_delta") {
+		delta, _ := f.Payload["delta"].(map[string]any)
+		if delta["type"] != "input_json_delta" {
+			continue
+		}
+		partial := asString(delta["partial_json"])
+		var args map[string]any
+		if err := json.Unmarshal([]byte(partial), &args); err != nil {
+			t.Fatalf("decode partial_json failed: %v payload=%s", err, partial)
+		}
+		if args["content"] != `{"message":"hi"}` {
+			t.Fatalf("expected content normalized to string, got %#v", args["content"])
+		}
+		if args["taskId"] != "1" {
+			t.Fatalf("expected taskId normalized to string, got %#v", args["taskId"])
+		}
+		return
+	}
+	t.Fatalf("expected input_json_delta frame, body=%s", rec.Body.String())
+}
--- a/internal/httpapi/claude/handler_tokens.go
+++ b/internal/httpapi/claude/handler_tokens.go
@@ -3,8 +3,6 @@ package claude
 import (
 	"encoding/json"
 	"net/http"
-
-	"ds2api/internal/util"
 )

 func (h *Handler) CountTokens(w http.ResponseWriter, r *http.Request) {
@@ -26,26 +24,11 @@ func (h *Handler) CountTokens(w http.ResponseWriter, r *http.Request) {
 		writeClaudeError(w, http.StatusBadRequest, "Request must include 'model' and 'messages'.")
 		return
 	}
-	inputTokens := 0
-	if sys, ok := req["system"].(string); ok {
-		inputTokens += util.EstimateTokens(sys)
-	}
-	for _, item := range messages {
-		msg, ok := item.(map[string]any)
-		if !ok {
-			continue
-		}
-		inputTokens += 2
-		inputTokens += util.EstimateTokens(extractMessageContent(msg["content"]))
-	}
-	if tools, ok := req["tools"].([]any); ok {
-		for _, t := range tools {
-			b, _ := json.Marshal(t)
-			inputTokens += util.EstimateTokens(string(b))
-		}
-	}
-	if inputTokens < 1 {
-		inputTokens = 1
+	normalized, err := normalizeClaudeRequest(h.Store, req)
+	if err != nil {
+		writeClaudeError(w, http.StatusBadRequest, err.Error())
+		return
 	}
+	inputTokens := countClaudeInputTokens(normalized.Standard)
 	writeJSON(w, http.StatusOK, map[string]any{"input_tokens": inputTokens})
 }
--- a/internal/httpapi/claude/prompt_token_text.go
+++ b/internal/httpapi/claude/prompt_token_text.go
@@ -0,0 +1,7 @@
+package claude
+
+import "ds2api/internal/prompt"
+
+func buildClaudePromptTokenText(messages []any, thinkingEnabled bool) string {
+	return prompt.MessagesPrepareWithThinking(toMessageMaps(messages), thinkingEnabled)
+}
--- a/internal/httpapi/claude/standard_request.go
+++ b/internal/httpapi/claude/standard_request.go
@@ -48,16 +48,18 @@ func normalizeClaudeRequest(store ConfigReader, req map[string]any) (claudeNorma

 	return claudeNormalizedRequest{
 		Standard: promptcompat.StandardRequest{
-			Surface:        "anthropic_messages",
-			RequestedModel: strings.TrimSpace(model),
-			ResolvedModel:  dsModel,
-			ResponseModel:  strings.TrimSpace(model),
-			Messages:       payload["messages"].([]any),
-			FinalPrompt:    finalPrompt,
-			ToolNames:      toolNames,
-			Stream:         util.ToBool(req["stream"]),
-			Thinking:       thinkingEnabled,
-			Search:         searchEnabled,
+			Surface:         "anthropic_messages",
+			RequestedModel:  strings.TrimSpace(model),
+			ResolvedModel:   dsModel,
+			ResponseModel:   strings.TrimSpace(model),
+			Messages:        payload["messages"].([]any),
+			PromptTokenText: finalPrompt,
+			ToolsRaw:        toolsRequested,
+			FinalPrompt:     finalPrompt,
+			ToolNames:       toolNames,
+			Stream:          util.ToBool(req["stream"]),
+			Thinking:        thinkingEnabled,
+			Search:          searchEnabled,
 		},
 		NormalizedMessages: normalizedMessages,
 	}, nil
--- a/internal/httpapi/claude/standard_request_test.go
+++ b/internal/httpapi/claude/standard_request_test.go
@@ -32,11 +32,39 @@ func TestNormalizeClaudeRequest(t *testing.T) {
 	if len(norm.Standard.ToolNames) == 0 {
 		t.Fatalf("expected tool names")
 	}
+	if norm.Standard.ToolsRaw == nil {
+		t.Fatalf("expected ToolsRaw preserved for downstream normalization")
+	}
 	if norm.Standard.FinalPrompt == "" {
 		t.Fatalf("expected non-empty final prompt")
 	}
 }

+func TestNormalizeClaudeRequestSupportsCamelCaseInputSchemaPromptInjection(t *testing.T) {
+	t.Setenv("DS2API_CONFIG_JSON", `{}`)
+	store := config.LoadStore()
+	req := map[string]any{
+		"model": "claude-sonnet-4-5",
+		"messages": []any{
+			map[string]any{"role": "user", "content": "hello"},
+		},
+		"tools": []any{
+			map[string]any{
+				"name":        "todowrite",
+				"description": "Write todos",
+				"inputSchema": map[string]any{"type": "object", "properties": map[string]any{"todos": map[string]any{"type": "array"}}},
+			},
+		},
+	}
+	norm, err := normalizeClaudeRequest(store, req)
+	if err != nil {
+		t.Fatalf("normalize failed: %v", err)
+	}
+	if !containsStr(norm.Standard.FinalPrompt, `"type":"array"`) {
+		t.Fatalf("expected inputSchema to be injected into prompt, got=%q", norm.Standard.FinalPrompt)
+	}
+}
+
 func TestNormalizeClaudeRequestInjectsToolsIntoExistingSystemMessage(t *testing.T) {
 	t.Setenv("DS2API_CONFIG_JSON", `{}`)
 	store := config.LoadStore()
--- a/internal/httpapi/claude/stream_runtime_core.go
+++ b/internal/httpapi/claude/stream_runtime_core.go
@@ -15,9 +15,11 @@ type claudeStreamRuntime struct {
 	rc       *http.ResponseController
 	canFlush bool

-	model     string
-	toolNames []string
-	messages  []any
+	model           string
+	toolNames       []string
+	messages        []any
+	toolsRaw        any
+	promptTokenText string

 	thinkingEnabled       bool
 	searchEnabled         bool
@@ -47,6 +49,8 @@ func newClaudeStreamRuntime(
 	searchEnabled bool,
 	stripReferenceMarkers bool,
 	toolNames []string,
+	toolsRaw any,
+	promptTokenText string,
 ) *claudeStreamRuntime {
 	return &claudeStreamRuntime{
 		w:                     w,
@@ -59,6 +63,8 @@ func newClaudeStreamRuntime(
 		bufferToolContent:     len(toolNames) > 0,
 		stripReferenceMarkers: stripReferenceMarkers,
 		toolNames:             toolNames,
+		toolsRaw:              toolsRaw,
+		promptTokenText:       promptTokenText,
 		messageID:             fmt.Sprintf("msg_%d", time.Now().UnixNano()),
 		thinkingBlockIndex:    -1,
 		textBlockIndex:        -1,
--- a/internal/httpapi/claude/stream_runtime_emit.go
+++ b/internal/httpapi/claude/stream_runtime_emit.go
@@ -42,7 +42,10 @@ func (s *claudeStreamRuntime) sendPing() {
 }

 func (s *claudeStreamRuntime) sendMessageStart() {
-	inputTokens := util.EstimateTokens(fmt.Sprintf("%v", s.messages))
+	inputTokens := countClaudeInputTokensFromText(s.promptTokenText, s.model)
+	if inputTokens == 0 {
+		inputTokens = util.CountPromptTokens(fmt.Sprintf("%v", s.messages), s.model)
+	}
 	s.send("message_start", map[string]any{
 		"type": "message_start",
 		"message": map[string]any{
--- a/internal/httpapi/claude/stream_runtime_finalize.go
+++ b/internal/httpapi/claude/stream_runtime_finalize.go
@@ -52,6 +52,7 @@ func (s *claudeStreamRuntime) finalize(stopReason string) {
 			detected = toolcall.ParseStandaloneToolCalls(finalThinking, s.toolNames)
 		}
 		if len(detected) > 0 {
+			detected = toolcall.NormalizeParsedToolCallsForSchemas(detected, s.toolsRaw)
 			stopReason = "tool_use"
 			for i, tc := range detected {
 				idx := s.nextBlockIndex + i
@@ -108,7 +109,7 @@ func (s *claudeStreamRuntime) finalize(stopReason string) {
 		}
 	}

-	outputTokens := util.EstimateTokens(finalThinking) + util.EstimateTokens(finalText)
+	outputTokens := util.CountOutputTokens(finalThinking, s.model) + util.CountOutputTokens(finalText, s.model)
 	s.send("message_delta", map[string]any{
 		"type": "message_delta",
 		"delta": map[string]any{
--- a/internal/httpapi/claude/token_count.go
+++ b/internal/httpapi/claude/token_count.go
@@ -0,0 +1,20 @@
+package claude
+
+import (
+	"strings"
+
+	"ds2api/internal/promptcompat"
+	"ds2api/internal/util"
+)
+
+func countClaudeInputTokens(stdReq promptcompat.StandardRequest) int {
+	promptText := stdReq.PromptTokenText
+	if strings.TrimSpace(promptText) == "" {
+		promptText = stdReq.FinalPrompt
+	}
+	return countClaudeInputTokensFromText(promptText, stdReq.ResolvedModel)
+}
+
+func countClaudeInputTokensFromText(promptText, model string) int {
+	return util.CountPromptTokens(promptText, model)
+}
--- a/internal/httpapi/gemini/convert_request.go
+++ b/internal/httpapi/gemini/convert_request.go
@@ -36,16 +36,17 @@ func normalizeGeminiRequest(store ConfigReader, routeModel string, req map[strin
 	passThrough := collectGeminiPassThrough(req)

 	return promptcompat.StandardRequest{
-		Surface:        "google_gemini",
-		RequestedModel: requestedModel,
-		ResolvedModel:  resolvedModel,
-		ResponseModel:  requestedModel,
-		Messages:       messagesRaw,
-		FinalPrompt:    finalPrompt,
-		ToolNames:      toolNames,
-		Stream:         stream,
-		Thinking:       thinkingEnabled,
-		Search:         searchEnabled,
-		PassThrough:    passThrough,
+		Surface:         "google_gemini",
+		RequestedModel:  requestedModel,
+		ResolvedModel:   resolvedModel,
+		ResponseModel:   requestedModel,
+		Messages:        messagesRaw,
+		PromptTokenText: finalPrompt,
+		FinalPrompt:     finalPrompt,
+		ToolNames:       toolNames,
+		Stream:          stream,
+		Thinking:        thinkingEnabled,
+		Search:          searchEnabled,
+		PassThrough:     passThrough,
 	}, nil
 }
--- a/internal/httpapi/gemini/handler_generate.go
+++ b/internal/httpapi/gemini/handler_generate.go
@@ -2,8 +2,8 @@ package gemini

 import (
 	"bytes"
-	"ds2api/internal/toolcall"
 	"encoding/json"
+	"errors"
 	"io"
 	"net/http"
 	"net/http/httptest"
@@ -11,7 +11,9 @@ import (

 	"github.com/go-chi/chi/v5"

+	"ds2api/internal/httpapi/requestbody"
 	"ds2api/internal/sse"
+	"ds2api/internal/toolcall"
 	"ds2api/internal/translatorcliproxy"
 	"ds2api/internal/util"

@@ -32,7 +34,11 @@ func (h *Handler) handleGenerateContent(w http.ResponseWriter, r *http.Request,
 func (h *Handler) proxyViaOpenAI(w http.ResponseWriter, r *http.Request, stream bool) bool {
 	raw, err := io.ReadAll(r.Body)
 	if err != nil {
-		writeGeminiError(w, http.StatusBadRequest, "invalid body")
+		if errors.Is(err, requestbody.ErrInvalidUTF8Body) {
+			writeGeminiError(w, http.StatusBadRequest, "invalid json")
+		} else {
+			writeGeminiError(w, http.StatusBadRequest, "invalid body")
+		}
 		return true
 	}
 	routeModel := strings.TrimSpace(chi.URLParam(r, "model"))
@@ -227,7 +233,7 @@ func (h *Handler) handleNonStreamGenerateContent(w http.ResponseWriter, resp *ht
 //nolint:unused // retained for native Gemini non-stream handling path.
 func buildGeminiGenerateContentResponse(model, finalPrompt, finalThinking, finalText string, toolNames []string) map[string]any {
 	parts := buildGeminiPartsFromFinal(finalText, finalThinking, toolNames)
-	usage := buildGeminiUsage(finalPrompt, finalThinking, finalText)
+	usage := buildGeminiUsage(model, finalPrompt, finalThinking, finalText)
 	return map[string]any{
 		"candidates": []map[string]any{
 			{
@@ -245,10 +251,10 @@ func buildGeminiGenerateContentResponse(model, finalPrompt, finalThinking, final
 }

 //nolint:unused // retained for native Gemini non-stream handling path.
-func buildGeminiUsage(finalPrompt, finalThinking, finalText string) map[string]any {
-	promptTokens := util.EstimateTokens(finalPrompt)
-	reasoningTokens := util.EstimateTokens(finalThinking)
-	completionTokens := util.EstimateTokens(finalText)
+func buildGeminiUsage(model, finalPrompt, finalThinking, finalText string) map[string]any {
+	promptTokens := util.CountPromptTokens(finalPrompt, model)
+	reasoningTokens := util.CountOutputTokens(finalThinking, model)
+	completionTokens := util.CountOutputTokens(finalText, model)
 	return map[string]any{
 		"promptTokenCount":     promptTokens,
 		"candidatesTokenCount": reasoningTokens + completionTokens,
--- a/internal/httpapi/gemini/handler_stream_runtime.go
+++ b/internal/httpapi/gemini/handler_stream_runtime.go
@@ -194,6 +194,6 @@ func (s *geminiStreamRuntime) finalize() {
 			},
 		},
 		"modelVersion":  s.model,
-		"usageMetadata": buildGeminiUsage(s.finalPrompt, finalThinking, finalText),
+		"usageMetadata": buildGeminiUsage(s.model, s.finalPrompt, finalThinking, finalText),
 	})
 }
--- a/internal/httpapi/openai/chat/chat_history_test.go
+++ b/internal/httpapi/openai/chat/chat_history_test.go
@@ -126,6 +126,7 @@ func TestStartChatHistoryRecoversFromTransientWriteFailure(t *testing.T) {
 	session := startChatHistory(historyStore, req, a, stdReq)
 	if session == nil {
 		t.Fatalf("expected session even when initial persistence fails")
+		return
 	}
 	if session.disabled {
 		t.Fatalf("expected session to remain active after transient start failure")
@@ -194,7 +195,7 @@ func TestHandleStreamContextCancelledMarksHistoryStopped(t *testing.T) {
 	rec := httptest.NewRecorder()
 	resp := makeOpenAISSEHTTPResponse(`data: {"p":"response/content","v":"hello"}`, `data: [DONE]`)

-	h.handleStream(rec, req, resp, "cid-stop", "deepseek-v4-flash", "prompt", false, false, nil, session)
+	h.handleStream(rec, req, resp, "cid-stop", "deepseek-v4-flash", "prompt", 0, false, false, nil, nil, session)

 	snapshot, err := historyStore.Snapshot()
 	if err != nil {
@@ -307,19 +308,19 @@ func TestChatCompletionsCurrentInputFilePersistsNeutralPrompt(t *testing.T) {
 	if err != nil {
 		t.Fatalf("expected detail item, got %v", err)
 	}
-	if full.HistoryText != "" {
-		t.Fatalf("expected current input file flow to leave history text empty, got %q", full.HistoryText)
-	}
 	if len(ds.uploadCalls) != 1 {
 		t.Fatalf("expected current input upload to happen, got %d", len(ds.uploadCalls))
 	}
-	if ds.uploadCalls[0].Filename != "IGNORE.txt" {
-		t.Fatalf("expected IGNORE.txt upload, got %q", ds.uploadCalls[0].Filename)
+	if ds.uploadCalls[0].Filename != "DS2API_HISTORY.txt" {
+		t.Fatalf("expected DS2API_HISTORY.txt upload, got %q", ds.uploadCalls[0].Filename)
+	}
+	if full.HistoryText != string(ds.uploadCalls[0].Data) {
+		t.Fatalf("expected uploaded current input file to be persisted in history text")
 	}
 	if len(full.Messages) != 1 {
-		t.Fatalf("expected compacted-context prompt to be the only persisted message, got %#v", full.Messages)
+		t.Fatalf("expected continuation prompt to be the only persisted message, got %#v", full.Messages)
 	}
-	if !strings.Contains(full.Messages[0].Content, promptcompat.BuildOpenAICurrentInputContextPrompt()) {
-		t.Fatalf("expected compacted-context prompt to be persisted, got %#v", full.Messages[0])
+	if !strings.Contains(full.Messages[0].Content, "Continue from the latest state in the attached DS2API_HISTORY.txt context.") {
+		t.Fatalf("expected continuation prompt to be persisted, got %#v", full.Messages[0])
 	}
 }
--- a/internal/httpapi/openai/chat/chat_stream_runtime.go
+++ b/internal/httpapi/openai/chat/chat_stream_runtime.go
@@ -16,11 +16,13 @@ type chatStreamRuntime struct {
 	rc       *http.ResponseController
 	canFlush bool

-	completionID string
-	created      int64
-	model        string
-	finalPrompt  string
-	toolNames    []string
+	completionID  string
+	created       int64
+	model         string
+	finalPrompt   string
+	refFileTokens int
+	toolNames     []string
+	toolsRaw      any

 	thinkingEnabled       bool
 	searchEnabled         bool
@@ -35,8 +37,10 @@ type chatStreamRuntime struct {
 	toolSieve             toolstream.State
 	streamToolCallIDs     map[int]string
 	streamToolNames       map[int]string
+	rawThinking           strings.Builder
 	thinking              strings.Builder
 	toolDetectionThinking strings.Builder
+	rawText               strings.Builder
 	text                  strings.Builder
 	responseMessageID     int

@@ -49,6 +53,32 @@ type chatStreamRuntime struct {
 	finalErrorCode    string
 }

+type chatDeltaBatch struct {
+	runtime *chatStreamRuntime
+	field   string
+	text    strings.Builder
+}
+
+func (b *chatDeltaBatch) append(field, text string) {
+	if text == "" {
+		return
+	}
+	if b.field != "" && b.field != field {
+		b.flush()
+	}
+	b.field = field
+	b.text.WriteString(text)
+}
+
+func (b *chatDeltaBatch) flush() {
+	if b.field == "" || b.text.Len() == 0 {
+		return
+	}
+	b.runtime.sendDelta(map[string]any{b.field: b.text.String()})
+	b.field = ""
+	b.text.Reset()
+}
+
 func newChatStreamRuntime(
 	w http.ResponseWriter,
 	rc *http.ResponseController,
@@ -61,6 +91,7 @@ func newChatStreamRuntime(
 	searchEnabled bool,
 	stripReferenceMarkers bool,
 	toolNames []string,
+	toolsRaw any,
 	bufferToolContent bool,
 	emitEarlyToolDeltas bool,
 ) *chatStreamRuntime {
@@ -73,6 +104,7 @@ func newChatStreamRuntime(
 		model:                 model,
 		finalPrompt:           finalPrompt,
 		toolNames:             toolNames,
+		toolsRaw:              toolsRaw,
 		thinkingEnabled:       thinkingEnabled,
 		searchEnabled:         searchEnabled,
 		stripReferenceMarkers: stripReferenceMarkers,
@@ -101,6 +133,23 @@ func (s *chatStreamRuntime) sendChunk(v any) {
 	}
 }

+func (s *chatStreamRuntime) sendDelta(delta map[string]any) {
+	if len(delta) == 0 {
+		return
+	}
+	if !s.firstChunkSent {
+		delta["role"] = "assistant"
+		s.firstChunkSent = true
+	}
+	s.sendChunk(openaifmt.BuildChatStreamChunk(
+		s.completionID,
+		s.created,
+		s.model,
+		[]map[string]any{openaifmt.BuildChatStreamDeltaChoice(0, delta)},
+		nil,
+	))
+}
+
 func (s *chatStreamRuntime) sendDone() {
 	_, _ = s.w.Write([]byte("data: [DONE]\n\n"))
 	if s.canFlush {
@@ -124,6 +173,15 @@ func (s *chatStreamRuntime) sendFailedChunk(status int, message, code string) {
 	s.sendDone()
 }

+func (s *chatStreamRuntime) markContextCancelled() {
+	s.finalErrorStatus = 499
+	s.finalErrorMessage = "request context cancelled"
+	s.finalErrorCode = string(streamengine.StopReasonContextCancelled)
+	s.finalThinking = s.thinking.String()
+	s.finalText = cleanVisibleOutput(s.text.String(), s.stripReferenceMarkers)
+	s.finalFinishReason = string(streamengine.StopReasonContextCancelled)
+}
+
 func (s *chatStreamRuntime) resetStreamToolCallState() {
 	s.streamToolCallIDs = map[int]string{}
 	s.streamToolNames = map[int]string{}
@@ -138,69 +196,37 @@ func (s *chatStreamRuntime) finalize(finishReason string, deferEmptyOutput bool)
 	finalText := cleanVisibleOutput(s.text.String(), s.stripReferenceMarkers)
 	s.finalThinking = finalThinking
 	s.finalText = finalText
-	detected := detectAssistantToolCalls(finalText, finalThinking, finalToolDetectionThinking, s.toolNames)
+	detected := detectAssistantToolCalls(s.rawText.String(), finalText, s.rawThinking.String(), finalToolDetectionThinking, s.toolNames)
 	if len(detected.Calls) > 0 && !s.toolCallsDoneEmitted {
 		finishReason = "tool_calls"
-		delta := map[string]any{
-			"tool_calls": formatFinalStreamToolCallsWithStableIDs(detected.Calls, s.streamToolCallIDs),
-		}
-		if !s.firstChunkSent {
-			delta["role"] = "assistant"
-			s.firstChunkSent = true
-		}
-		s.sendChunk(openaifmt.BuildChatStreamChunk(
-			s.completionID,
-			s.created,
-			s.model,
-			[]map[string]any{openaifmt.BuildChatStreamDeltaChoice(0, delta)},
-			nil,
-		))
+		s.sendDelta(map[string]any{
+			"tool_calls": formatFinalStreamToolCallsWithStableIDs(detected.Calls, s.streamToolCallIDs, s.toolsRaw),
+		})
 		s.toolCallsEmitted = true
 		s.toolCallsDoneEmitted = true
 	} else if s.bufferToolContent {
+		batch := chatDeltaBatch{runtime: s}
 		for _, evt := range toolstream.Flush(&s.toolSieve, s.toolNames) {
 			if len(evt.ToolCalls) > 0 {
+				batch.flush()
 				finishReason = "tool_calls"
 				s.toolCallsEmitted = true
 				s.toolCallsDoneEmitted = true
-				tcDelta := map[string]any{
-					"tool_calls": formatFinalStreamToolCallsWithStableIDs(evt.ToolCalls, s.streamToolCallIDs),
-				}
-				if !s.firstChunkSent {
-					tcDelta["role"] = "assistant"
-					s.firstChunkSent = true
-				}
-				s.sendChunk(openaifmt.BuildChatStreamChunk(
-					s.completionID,
-					s.created,
-					s.model,
-					[]map[string]any{openaifmt.BuildChatStreamDeltaChoice(0, tcDelta)},
-					nil,
-				))
+				s.sendDelta(map[string]any{
+					"tool_calls": formatFinalStreamToolCallsWithStableIDs(evt.ToolCalls, s.streamToolCallIDs, s.toolsRaw),
+				})
 				s.resetStreamToolCallState()
 			}
 			if evt.Content == "" {
 				continue
 			}
 			cleaned := cleanVisibleOutput(evt.Content, s.stripReferenceMarkers)
-			if cleaned == "" {
+			if cleaned == "" || (s.searchEnabled && sse.IsCitation(cleaned)) {
 				continue
 			}
-			delta := map[string]any{
-				"content": cleaned,
-			}
-			if !s.firstChunkSent {
-				delta["role"] = "assistant"
-				s.firstChunkSent = true
-			}
-			s.sendChunk(openaifmt.BuildChatStreamChunk(
-				s.completionID,
-				s.created,
-				s.model,
-				[]map[string]any{openaifmt.BuildChatStreamDeltaChoice(0, delta)},
-				nil,
-			))
+			batch.append("content", cleaned)
 		}
+		batch.flush()
 	}

 	if len(detected.Calls) > 0 || s.toolCallsEmitted {
@@ -217,7 +243,7 @@ func (s *chatStreamRuntime) finalize(finishReason string, deferEmptyOutput bool)
 		s.sendFailedChunk(status, message, code)
 		return true
 	}
-	usage := openaifmt.BuildChatUsage(s.finalPrompt, finalThinking, finalText)
+	usage := openaifmt.BuildChatUsageForModel(s.model, s.finalPrompt, finalThinking, finalText, s.refFileTokens)
 	s.finalFinishReason = finishReason
 	s.finalUsage = usage
 	s.sendChunk(openaifmt.BuildChatStreamChunk(
@@ -251,8 +277,8 @@ func (s *chatStreamRuntime) onParsed(parsed sse.LineResult) streamengine.ParsedD
 		return streamengine.ParsedDecision{Stop: true, StopReason: streamengine.StopReasonHandlerRequested}
 	}

-	newChoices := make([]map[string]any, 0, len(parsed.Parts))
 	contentSeen := false
+	batch := chatDeltaBatch{runtime: s}
 	for _, p := range parsed.ToolDetectionThinkingParts {
 		trimmed := sse.TrimContinuationOverlap(s.toolDetectionThinking.String(), p.Text)
 		if trimmed != "" {
@@ -260,38 +286,46 @@ func (s *chatStreamRuntime) onParsed(parsed sse.LineResult) streamengine.ParsedD
 		}
 	}
 	for _, p := range parsed.Parts {
-		cleanedText := cleanVisibleOutput(p.Text, s.stripReferenceMarkers)
-		if s.searchEnabled && sse.IsCitation(cleanedText) {
-			continue
-		}
-		if cleanedText == "" {
-			continue
-		}
-		contentSeen = true
-		delta := map[string]any{}
-		if !s.firstChunkSent {
-			delta["role"] = "assistant"
-			s.firstChunkSent = true
-		}
 		if p.Type == "thinking" {
+			rawTrimmed := sse.TrimContinuationOverlap(s.rawThinking.String(), p.Text)
+			if rawTrimmed != "" {
+				s.rawThinking.WriteString(rawTrimmed)
+				contentSeen = true
+			}
 			if s.thinkingEnabled {
+				cleanedText := cleanVisibleOutput(rawTrimmed, s.stripReferenceMarkers)
+				if cleanedText == "" {
+					continue
+				}
 				trimmed := sse.TrimContinuationOverlap(s.thinking.String(), cleanedText)
 				if trimmed == "" {
 					continue
 				}
 				s.thinking.WriteString(trimmed)
-				delta["reasoning_content"] = trimmed
+				batch.append("reasoning_content", trimmed)
 			}
 		} else {
-			trimmed := sse.TrimContinuationOverlap(s.text.String(), cleanedText)
-			if trimmed == "" {
+			rawTrimmed := sse.TrimContinuationOverlap(s.rawText.String(), p.Text)
+			if rawTrimmed == "" {
 				continue
 			}
-			s.text.WriteString(trimmed)
+			s.rawText.WriteString(rawTrimmed)
+			contentSeen = true
+			cleanedText := cleanVisibleOutput(rawTrimmed, s.stripReferenceMarkers)
+			if s.searchEnabled && sse.IsCitation(cleanedText) {
+				continue
+			}
+			trimmed := sse.TrimContinuationOverlap(s.text.String(), cleanedText)
+			if trimmed != "" {
+				s.text.WriteString(trimmed)
+			}
 			if !s.bufferToolContent {
-				delta["content"] = trimmed
+				if trimmed == "" {
+					continue
+				}
+				batch.append("content", trimmed)
 			} else {
-				events := toolstream.ProcessChunk(&s.toolSieve, trimmed, s.toolNames)
+				events := toolstream.ProcessChunk(&s.toolSieve, rawTrimmed, s.toolNames)
 				for _, evt := range events {
 					if len(evt.ToolCallDeltas) > 0 {
 						if !s.emitEarlyToolDeltas {
@@ -305,55 +339,36 @@ func (s *chatStreamRuntime) onParsed(parsed sse.LineResult) streamengine.ParsedD
 						if len(formatted) == 0 {
 							continue
 						}
+						batch.flush()
 						tcDelta := map[string]any{
 							"tool_calls": formatted,
 						}
 						s.toolCallsEmitted = true
-						if !s.firstChunkSent {
-							tcDelta["role"] = "assistant"
-							s.firstChunkSent = true
-						}
-						newChoices = append(newChoices, openaifmt.BuildChatStreamDeltaChoice(0, tcDelta))
+						s.sendDelta(tcDelta)
 						continue
 					}
 					if len(evt.ToolCalls) > 0 {
+						batch.flush()
 						s.toolCallsEmitted = true
 						s.toolCallsDoneEmitted = true
 						tcDelta := map[string]any{
-							"tool_calls": formatFinalStreamToolCallsWithStableIDs(evt.ToolCalls, s.streamToolCallIDs),
+							"tool_calls": formatFinalStreamToolCallsWithStableIDs(evt.ToolCalls, s.streamToolCallIDs, s.toolsRaw),
 						}
-						if !s.firstChunkSent {
-							tcDelta["role"] = "assistant"
-							s.firstChunkSent = true
-						}
-						newChoices = append(newChoices, openaifmt.BuildChatStreamDeltaChoice(0, tcDelta))
+						s.sendDelta(tcDelta)
 						s.resetStreamToolCallState()
 						continue
 					}
 					if evt.Content != "" {
 						cleaned := cleanVisibleOutput(evt.Content, s.stripReferenceMarkers)
-						if cleaned == "" {
+						if cleaned == "" || (s.searchEnabled && sse.IsCitation(cleaned)) {
 							continue
 						}
-						contentDelta := map[string]any{
-							"content": cleaned,
-						}
-						if !s.firstChunkSent {
-							contentDelta["role"] = "assistant"
-							s.firstChunkSent = true
-						}
-						newChoices = append(newChoices, openaifmt.BuildChatStreamDeltaChoice(0, contentDelta))
+						batch.append("content", cleaned)
 					}
 				}
 			}
 		}
-		if len(delta) > 0 {
-			newChoices = append(newChoices, openaifmt.BuildChatStreamDeltaChoice(0, delta))
-		}
-	}
-
-	if len(newChoices) > 0 {
-		s.sendChunk(openaifmt.BuildChatStreamChunk(s.completionID, s.created, s.model, newChoices, nil))
 	}
+	batch.flush()
 	return streamengine.ParsedDecision{ContentSeen: contentSeen}
 }
--- a/internal/httpapi/openai/chat/empty_retry_runtime.go
+++ b/internal/httpapi/openai/chat/empty_retry_runtime.go
@@ -16,6 +16,8 @@ import (
 )

 type chatNonStreamResult struct {
+	rawThinking           string
+	rawText               string
 	thinking              string
 	toolDetectionThinking string
 	text                  string
@@ -26,27 +28,31 @@ type chatNonStreamResult struct {
 	responseMessageID     int
 }

-func (h *Handler) handleNonStreamWithRetry(w http.ResponseWriter, ctx context.Context, a *auth.RequestAuth, resp *http.Response, payload map[string]any, pow, completionID, model, finalPrompt string, thinkingEnabled, searchEnabled bool, toolNames []string, historySession *chatHistorySession) {
+func (h *Handler) handleNonStreamWithRetry(w http.ResponseWriter, ctx context.Context, a *auth.RequestAuth, resp *http.Response, payload map[string]any, pow, completionID, model, finalPrompt string, refFileTokens int, thinkingEnabled, searchEnabled bool, toolNames []string, toolsRaw any, historySession *chatHistorySession) {
 	attempts := 0
 	currentResp := resp
 	usagePrompt := finalPrompt
 	accumulatedThinking := ""
+	accumulatedRawThinking := ""
 	accumulatedToolDetectionThinking := ""
 	for {
-		result, ok := h.collectChatNonStreamAttempt(w, currentResp, completionID, model, usagePrompt, thinkingEnabled, searchEnabled, toolNames)
+		result, ok := h.collectChatNonStreamAttempt(w, currentResp, completionID, model, usagePrompt, thinkingEnabled, searchEnabled, toolNames, toolsRaw)
 		if !ok {
 			return
 		}
 		accumulatedThinking += sse.TrimContinuationOverlap(accumulatedThinking, result.thinking)
+		accumulatedRawThinking += sse.TrimContinuationOverlap(accumulatedRawThinking, result.rawThinking)
 		accumulatedToolDetectionThinking += sse.TrimContinuationOverlap(accumulatedToolDetectionThinking, result.toolDetectionThinking)
 		result.thinking = accumulatedThinking
+		result.rawThinking = accumulatedRawThinking
 		result.toolDetectionThinking = accumulatedToolDetectionThinking
-		detected := detectAssistantToolCalls(result.text, result.thinking, result.toolDetectionThinking, toolNames)
+		detected := detectAssistantToolCalls(result.rawText, result.text, result.rawThinking, result.toolDetectionThinking, toolNames)
 		result.detectedCalls = len(detected.Calls)
-		result.body = openaifmt.BuildChatCompletionWithToolCalls(completionID, model, usagePrompt, result.thinking, result.text, detected.Calls)
+		result.body = openaifmt.BuildChatCompletionWithToolCalls(completionID, model, usagePrompt, result.thinking, result.text, detected.Calls, toolsRaw)
+		addRefFileTokensToUsage(result.body, refFileTokens)
 		result.finishReason = chatFinishReason(result.body)
 		if !shouldRetryChatNonStream(result, attempts) {
-			h.finishChatNonStreamResult(w, result, attempts, usagePrompt, historySession)
+			h.finishChatNonStreamResult(w, result, attempts, usagePrompt, refFileTokens, historySession)
 			return
 		}

@@ -67,12 +73,12 @@ func (h *Handler) handleNonStreamWithRetry(w http.ResponseWriter, ctx context.Co
 			config.Logger.Warn("[openai_empty_retry] retry request failed", "surface", "chat.completions", "stream", false, "retry_attempt", attempts, "error", err)
 			return
 		}
-		usagePrompt = usagePromptWithEmptyOutputRetry(finalPrompt, attempts)
+		usagePrompt = usagePromptWithEmptyOutputRetry(usagePrompt, attempts)
 		currentResp = nextResp
 	}
 }

-func (h *Handler) collectChatNonStreamAttempt(w http.ResponseWriter, resp *http.Response, completionID, model, usagePrompt string, thinkingEnabled, searchEnabled bool, toolNames []string) (chatNonStreamResult, bool) {
+func (h *Handler) collectChatNonStreamAttempt(w http.ResponseWriter, resp *http.Response, completionID, model, usagePrompt string, thinkingEnabled, searchEnabled bool, toolNames []string, toolsRaw any) (chatNonStreamResult, bool) {
 	if resp.StatusCode != http.StatusOK {
 		defer func() { _ = resp.Body.Close() }()
 		body, _ := io.ReadAll(resp.Body)
@@ -82,16 +88,17 @@ func (h *Handler) collectChatNonStreamAttempt(w http.ResponseWriter, resp *http.
 	result := sse.CollectStream(resp, thinkingEnabled, true)
 	stripReferenceMarkers := h.compatStripReferenceMarkers()
 	finalThinking := cleanVisibleOutput(result.Thinking, stripReferenceMarkers)
-	finalToolDetectionThinking := cleanVisibleOutput(result.ToolDetectionThinking, stripReferenceMarkers)
 	finalText := cleanVisibleOutput(result.Text, stripReferenceMarkers)
 	if searchEnabled {
 		finalText = replaceCitationMarkersWithLinks(finalText, result.CitationLinks)
 	}
-	detected := detectAssistantToolCalls(finalText, finalThinking, finalToolDetectionThinking, toolNames)
-	respBody := openaifmt.BuildChatCompletionWithToolCalls(completionID, model, usagePrompt, finalThinking, finalText, detected.Calls)
+	detected := detectAssistantToolCalls(result.Text, finalText, result.Thinking, result.ToolDetectionThinking, toolNames)
+	respBody := openaifmt.BuildChatCompletionWithToolCalls(completionID, model, usagePrompt, finalThinking, finalText, detected.Calls, toolsRaw)
 	return chatNonStreamResult{
+		rawThinking:           result.Thinking,
+		rawText:               result.Text,
 		thinking:              finalThinking,
-		toolDetectionThinking: finalToolDetectionThinking,
+		toolDetectionThinking: result.ToolDetectionThinking,
 		text:                  finalText,
 		contentFilter:         result.ContentFilter,
 		detectedCalls:         len(detected.Calls),
@@ -101,7 +108,7 @@ func (h *Handler) collectChatNonStreamAttempt(w http.ResponseWriter, resp *http.
 	}, true
 }

-func (h *Handler) finishChatNonStreamResult(w http.ResponseWriter, result chatNonStreamResult, attempts int, usagePrompt string, historySession *chatHistorySession) {
+func (h *Handler) finishChatNonStreamResult(w http.ResponseWriter, result chatNonStreamResult, attempts int, usagePrompt string, refFileTokens int, historySession *chatHistorySession) {
 	if result.detectedCalls == 0 && shouldWriteUpstreamEmptyOutputError(result.text) {
 		status, message, code := upstreamEmptyOutputDetail(result.contentFilter, result.text, result.thinking)
 		if historySession != nil {
@@ -112,7 +119,7 @@ func (h *Handler) finishChatNonStreamResult(w http.ResponseWriter, result chatNo
 		return
 	}
 	if historySession != nil {
-		historySession.success(http.StatusOK, result.thinking, result.text, result.finishReason, openaifmt.BuildChatUsage(usagePrompt, result.thinking, result.text))
+		historySession.success(http.StatusOK, result.thinking, result.text, result.finishReason, openaifmt.BuildChatUsageForModel("", usagePrompt, result.thinking, result.text, refFileTokens))
 	}
 	writeJSON(w, http.StatusOK, result.body)
 	source := "first_attempt"
@@ -139,8 +146,8 @@ func shouldRetryChatNonStream(result chatNonStreamResult, attempts int) bool {
 		strings.TrimSpace(result.text) == ""
 }

-func (h *Handler) handleStreamWithRetry(w http.ResponseWriter, r *http.Request, a *auth.RequestAuth, resp *http.Response, payload map[string]any, pow, completionID, model, finalPrompt string, thinkingEnabled, searchEnabled bool, toolNames []string, historySession *chatHistorySession) {
-	streamRuntime, initialType, ok := h.prepareChatStreamRuntime(w, resp, completionID, model, finalPrompt, thinkingEnabled, searchEnabled, toolNames, historySession)
+func (h *Handler) handleStreamWithRetry(w http.ResponseWriter, r *http.Request, a *auth.RequestAuth, resp *http.Response, payload map[string]any, pow, completionID, model, finalPrompt string, refFileTokens int, thinkingEnabled, searchEnabled bool, toolNames []string, toolsRaw any, historySession *chatHistorySession) {
+	streamRuntime, initialType, ok := h.prepareChatStreamRuntime(w, resp, completionID, model, finalPrompt, refFileTokens, thinkingEnabled, searchEnabled, toolNames, toolsRaw, historySession)
 	if !ok {
 		return
 	}
@@ -182,7 +189,7 @@ func (h *Handler) handleStreamWithRetry(w http.ResponseWriter, r *http.Request,
 	}
 }

-func (h *Handler) prepareChatStreamRuntime(w http.ResponseWriter, resp *http.Response, completionID, model, finalPrompt string, thinkingEnabled, searchEnabled bool, toolNames []string, historySession *chatHistorySession) (*chatStreamRuntime, string, bool) {
+func (h *Handler) prepareChatStreamRuntime(w http.ResponseWriter, resp *http.Response, completionID, model, finalPrompt string, refFileTokens int, thinkingEnabled, searchEnabled bool, toolNames []string, toolsRaw any, historySession *chatHistorySession) (*chatStreamRuntime, string, bool) {
 	if resp.StatusCode != http.StatusOK {
 		defer func() { _ = resp.Body.Close() }()
 		body, _ := io.ReadAll(resp.Body)
@@ -207,9 +214,10 @@ func (h *Handler) prepareChatStreamRuntime(w http.ResponseWriter, resp *http.Res
 	}
 	streamRuntime := newChatStreamRuntime(
 		w, rc, canFlush, completionID, time.Now().Unix(), model, finalPrompt,
-		thinkingEnabled, searchEnabled, h.compatStripReferenceMarkers(), toolNames,
+		thinkingEnabled, searchEnabled, h.compatStripReferenceMarkers(), toolNames, toolsRaw,
 		len(toolNames) > 0, h.toolcallFeatureMatchEnabled() && h.toolcallEarlyEmitHighConfidence(),
 	)
+	streamRuntime.refFileTokens = refFileTokens
 	return streamRuntime, initialType, true
 }

@@ -239,11 +247,15 @@ func (h *Handler) consumeChatStreamAttempt(r *http.Request, resp *http.Response,
 			}
 		},
 		OnContextDone: func() {
+			streamRuntime.markContextCancelled()
 			if historySession != nil {
 				historySession.stopped(streamRuntime.thinking.String(), streamRuntime.text.String(), string(streamengine.StopReasonContextCancelled))
 			}
 		},
 	})
+	if streamRuntime.finalErrorCode == string(streamengine.StopReasonContextCancelled) {
+		return true, false
+	}
 	terminalWritten := streamRuntime.finalize(finalReason, allowDeferEmpty && finalReason != "content_filter")
 	if terminalWritten {
 		recordChatStreamHistory(streamRuntime, historySession)
@@ -275,6 +287,10 @@ func logChatStreamTerminal(streamRuntime *chatStreamRuntime, attempts int) {
 	if attempts > 0 {
 		source = "synthetic_retry"
 	}
+	if streamRuntime.finalErrorCode == string(streamengine.StopReasonContextCancelled) {
+		config.Logger.Info("[openai_empty_retry] terminal cancelled", "surface", "chat.completions", "stream", true, "retry_attempts", attempts, "error_code", streamRuntime.finalErrorCode)
+		return
+	}
 	if streamRuntime.finalErrorMessage != "" {
 		config.Logger.Info("[openai_empty_retry] terminal empty output", "surface", "chat.completions", "stream", true, "retry_attempts", attempts, "success_source", "none", "error_code", streamRuntime.finalErrorCode)
 		return
--- a/internal/httpapi/openai/chat/empty_retry_runtime_test.go
+++ b/internal/httpapi/openai/chat/empty_retry_runtime_test.go
@@ -0,0 +1,85 @@
+package chat
+
+import (
+	"context"
+	"net/http"
+	"net/http/httptest"
+	"testing"
+	"time"
+
+	"ds2api/internal/chathistory"
+	"ds2api/internal/stream"
+)
+
+func TestConsumeChatStreamAttemptMarksContextCancelledState(t *testing.T) {
+	historyStore := newTestChatHistoryStore(t)
+	entry, err := historyStore.Start(chathistory.StartParams{
+		CallerID:  "caller:test",
+		Model:     "deepseek-v4-flash",
+		Stream:    true,
+		UserInput: "hello",
+	})
+	if err != nil {
+		t.Fatalf("start history failed: %v", err)
+	}
+	session := &chatHistorySession{
+		store:       historyStore,
+		entryID:     entry.ID,
+		startedAt:   time.Now(),
+		lastPersist: time.Now(),
+		finalPrompt: "prompt",
+	}
+
+	ctx, cancel := context.WithCancel(context.Background())
+	cancel()
+
+	req := httptest.NewRequest(http.MethodPost, "/v1/chat/completions", nil).WithContext(ctx)
+	rec := httptest.NewRecorder()
+	streamRuntime := newChatStreamRuntime(
+		rec,
+		http.NewResponseController(rec),
+		true,
+		"cid-cancelled",
+		time.Now().Unix(),
+		"deepseek-v4-flash",
+		"prompt",
+		false,
+		false,
+		true,
+		nil,
+		nil,
+		false,
+		false,
+	)
+	resp := makeOpenAISSEHTTPResponse(
+		`data: {"p":"response/content","v":"hello"}`,
+		`data: [DONE]`,
+	)
+
+	h := &Handler{}
+	terminalWritten, retryable := h.consumeChatStreamAttempt(req, resp, streamRuntime, "text", false, session, true)
+	if !terminalWritten || retryable {
+		t.Fatalf("expected cancelled attempt to terminate without retry, got terminalWritten=%v retryable=%v", terminalWritten, retryable)
+	}
+	if got, want := streamRuntime.finalErrorCode, string(stream.StopReasonContextCancelled); got != want {
+		t.Fatalf("expected cancelled final error code %q, got %q", want, got)
+	}
+	if streamRuntime.finalErrorMessage == "" {
+		t.Fatalf("expected cancelled final error message to be preserved")
+	}
+
+	snapshot, err := historyStore.Snapshot()
+	if err != nil {
+		t.Fatalf("snapshot failed: %v", err)
+	}
+	if len(snapshot.Items) != 1 {
+		t.Fatalf("expected one history item, got %d", len(snapshot.Items))
+	}
+	full, err := historyStore.Get(snapshot.Items[0].ID)
+	if err != nil {
+		t.Fatalf("get detail failed: %v", err)
+	}
+	if full.Status != "stopped" {
+		t.Fatalf("expected stopped status, got %#v", full)
+	}
+}
--- a/internal/httpapi/openai/chat/handler.go
+++ b/internal/httpapi/openai/chat/handler.go
@@ -144,10 +144,10 @@ func filterIncrementalToolCallDeltasByAllowed(deltas []toolstream.ToolCallDelta,
 	return shared.FilterIncrementalToolCallDeltasByAllowed(deltas, seenNames)
 }

-func formatFinalStreamToolCallsWithStableIDs(calls []toolcall.ParsedToolCall, ids map[int]string) []map[string]any {
-	return shared.FormatFinalStreamToolCallsWithStableIDs(calls, ids)
+func formatFinalStreamToolCallsWithStableIDs(calls []toolcall.ParsedToolCall, ids map[int]string, toolsRaw any) []map[string]any {
+	return shared.FormatFinalStreamToolCallsWithStableIDs(calls, ids, toolsRaw)
 }

-func detectAssistantToolCalls(text, exposedThinking, detectionThinking string, toolNames []string) toolcall.ToolCallParseResult {
-	return shared.DetectAssistantToolCalls(text, exposedThinking, detectionThinking, toolNames)
+func detectAssistantToolCalls(rawText, visibleText, exposedThinking, detectionThinking string, toolNames []string) toolcall.ToolCallParseResult {
+	return shared.DetectAssistantToolCalls(rawText, visibleText, exposedThinking, detectionThinking, toolNames)
 }
--- a/internal/httpapi/openai/chat/handler_chat.go
+++ b/internal/httpapi/openai/chat/handler_chat.go
@@ -108,11 +108,12 @@ func (h *Handler) ChatCompletions(w http.ResponseWriter, r *http.Request) {
 		writeOpenAIError(w, http.StatusInternalServerError, "Failed to get completion.")
 		return
 	}
+	refFileTokens := stdReq.RefFileTokens
 	if stdReq.Stream {
-		h.handleStreamWithRetry(w, r, a, resp, payload, pow, sessionID, stdReq.ResponseModel, stdReq.FinalPrompt, stdReq.Thinking, stdReq.Search, stdReq.ToolNames, historySession)
+		h.handleStreamWithRetry(w, r, a, resp, payload, pow, sessionID, stdReq.ResponseModel, stdReq.PromptTokenText, refFileTokens, stdReq.Thinking, stdReq.Search, stdReq.ToolNames, stdReq.ToolsRaw, historySession)
 		return
 	}
-	h.handleNonStreamWithRetry(w, r.Context(), a, resp, payload, pow, sessionID, stdReq.ResponseModel, stdReq.FinalPrompt, stdReq.Thinking, stdReq.Search, stdReq.ToolNames, historySession)
+	h.handleNonStreamWithRetry(w, r.Context(), a, resp, payload, pow, sessionID, stdReq.ResponseModel, stdReq.PromptTokenText, refFileTokens, stdReq.Thinking, stdReq.Search, stdReq.ToolNames, stdReq.ToolsRaw, historySession)
 }

 func (h *Handler) autoDeleteRemoteSession(ctx context.Context, a *auth.RequestAuth, sessionID string) {
@@ -148,7 +149,7 @@ func (h *Handler) autoDeleteRemoteSession(ctx context.Context, a *auth.RequestAu
 	}
 }

-func (h *Handler) handleNonStream(w http.ResponseWriter, resp *http.Response, completionID, model, finalPrompt string, thinkingEnabled, searchEnabled bool, toolNames []string, historySession *chatHistorySession) {
+func (h *Handler) handleNonStream(w http.ResponseWriter, resp *http.Response, completionID, model, finalPrompt string, refFileTokens int, thinkingEnabled, searchEnabled bool, toolNames []string, toolsRaw any, historySession *chatHistorySession) {
 	if resp.StatusCode != http.StatusOK {
 		defer func() { _ = resp.Body.Close() }()
 		body, _ := io.ReadAll(resp.Body)
@@ -162,12 +163,11 @@ func (h *Handler) handleNonStream(w http.ResponseWriter, resp *http.Response, co

 	stripReferenceMarkers := h.compatStripReferenceMarkers()
 	finalThinking := cleanVisibleOutput(result.Thinking, stripReferenceMarkers)
-	finalToolDetectionThinking := cleanVisibleOutput(result.ToolDetectionThinking, stripReferenceMarkers)
 	finalText := cleanVisibleOutput(result.Text, stripReferenceMarkers)
 	if searchEnabled {
 		finalText = replaceCitationMarkersWithLinks(finalText, result.CitationLinks)
 	}
-	detected := detectAssistantToolCalls(finalText, finalThinking, finalToolDetectionThinking, toolNames)
+	detected := detectAssistantToolCalls(result.Text, finalText, result.Thinking, result.ToolDetectionThinking, toolNames)
 	if shouldWriteUpstreamEmptyOutputError(finalText) && len(detected.Calls) == 0 {
 		status, message, code := upstreamEmptyOutputDetail(result.ContentFilter, finalText, finalThinking)
 		if historySession != nil {
@@ -176,7 +176,10 @@ func (h *Handler) handleNonStream(w http.ResponseWriter, resp *http.Response, co
 		writeUpstreamEmptyOutputError(w, finalText, finalThinking, result.ContentFilter)
 		return
 	}
-	respBody := openaifmt.BuildChatCompletionWithToolCalls(completionID, model, finalPrompt, finalThinking, finalText, detected.Calls)
+	respBody := openaifmt.BuildChatCompletionWithToolCalls(completionID, model, finalPrompt, finalThinking, finalText, detected.Calls, toolsRaw)
+	if refFileTokens > 0 {
+		addRefFileTokensToUsage(respBody, refFileTokens)
+	}
 	finishReason := "stop"
 	if choices, ok := respBody["choices"].([]map[string]any); ok && len(choices) > 0 {
 		if fr, _ := choices[0]["finish_reason"].(string); strings.TrimSpace(fr) != "" {
@@ -184,12 +187,12 @@ func (h *Handler) handleNonStream(w http.ResponseWriter, resp *http.Response, co
 		}
 	}
 	if historySession != nil {
-		historySession.success(http.StatusOK, finalThinking, finalText, finishReason, openaifmt.BuildChatUsage(finalPrompt, finalThinking, finalText))
+		historySession.success(http.StatusOK, finalThinking, finalText, finishReason, openaifmt.BuildChatUsageForModel(model, finalPrompt, finalThinking, finalText, refFileTokens))
 	}
 	writeJSON(w, http.StatusOK, respBody)
 }

-func (h *Handler) handleStream(w http.ResponseWriter, r *http.Request, resp *http.Response, completionID, model, finalPrompt string, thinkingEnabled, searchEnabled bool, toolNames []string, historySession *chatHistorySession) {
+func (h *Handler) handleStream(w http.ResponseWriter, r *http.Request, resp *http.Response, completionID, model, finalPrompt string, refFileTokens int, thinkingEnabled, searchEnabled bool, toolNames []string, toolsRaw any, historySession *chatHistorySession) {
 	defer func() { _ = resp.Body.Close() }()
 	if resp.StatusCode != http.StatusOK {
 		body, _ := io.ReadAll(resp.Body)
@@ -230,9 +233,11 @@ func (h *Handler) handleStream(w http.ResponseWriter, r *http.Request, resp *htt
 		searchEnabled,
 		stripReferenceMarkers,
 		toolNames,
+		toolsRaw,
 		bufferToolContent,
 		emitEarlyToolDeltas,
 	)
+	streamRuntime.refFileTokens = refFileTokens

 	streamengine.ConsumeSSE(streamengine.ConsumeConfig{
 		Context:             r.Context(),
--- a/internal/httpapi/openai/chat/handler_toolcall_test.go
+++ b/internal/httpapi/openai/chat/handler_toolcall_test.go
@@ -1,6 +1,7 @@
 package chat

 import (
+	"context"
 	"encoding/json"
 	"io"
 	"net/http"
@@ -93,7 +94,7 @@ func TestHandleNonStreamReturns429WhenUpstreamOutputEmpty(t *testing.T) {
 	)
 	rec := httptest.NewRecorder()

-	h.handleNonStream(rec, resp, "cid-empty", "deepseek-v4-flash", "prompt", false, false, nil, nil)
+	h.handleNonStream(rec, resp, "cid-empty", "deepseek-v4-flash", "prompt", 0, false, false, nil, nil, nil)
 	if rec.Code != http.StatusTooManyRequests {
 		t.Fatalf("expected status 429 for empty upstream output, got %d body=%s", rec.Code, rec.Body.String())
 	}
@@ -112,7 +113,7 @@ func TestHandleNonStreamReturnsContentFilterErrorWhenUpstreamFilteredWithoutOutp
 	)
 	rec := httptest.NewRecorder()

-	h.handleNonStream(rec, resp, "cid-empty-filtered", "deepseek-v4-flash", "prompt", false, false, nil, nil)
+	h.handleNonStream(rec, resp, "cid-empty-filtered", "deepseek-v4-flash", "prompt", 0, false, false, nil, nil, nil)
 	if rec.Code != http.StatusBadRequest {
 		t.Fatalf("expected status 400 for filtered upstream output, got %d body=%s", rec.Code, rec.Body.String())
 	}
@@ -131,7 +132,7 @@ func TestHandleNonStreamReturns429WhenUpstreamHasOnlyThinking(t *testing.T) {
 	)
 	rec := httptest.NewRecorder()

-	h.handleNonStream(rec, resp, "cid-thinking-only", "deepseek-v4-pro", "prompt", true, false, nil, nil)
+	h.handleNonStream(rec, resp, "cid-thinking-only", "deepseek-v4-pro", "prompt", 0, true, false, nil, nil, nil)
 	if rec.Code != http.StatusTooManyRequests {
 		t.Fatalf("expected status 429 for thinking-only upstream output, got %d body=%s", rec.Code, rec.Body.String())
 	}
@@ -150,7 +151,7 @@ func TestHandleNonStreamPromotesThinkingToolCallsWhenTextEmpty(t *testing.T) {
 	)
 	rec := httptest.NewRecorder()

-	h.handleNonStream(rec, resp, "cid-thinking-tool", "deepseek-v4-pro", "prompt", true, false, []string{"search"}, nil)
+	h.handleNonStream(rec, resp, "cid-thinking-tool", "deepseek-v4-pro", "prompt", 0, true, false, []string{"search"}, nil, nil)
 	if rec.Code != http.StatusOK {
 		t.Fatalf("expected 200 for thinking tool calls, got %d body=%s", rec.Code, rec.Body.String())
 	}
@@ -181,7 +182,7 @@ func TestHandleNonStreamPromotesHiddenThinkingDSMLToolCallsWhenTextEmpty(t *test
 	)
 	rec := httptest.NewRecorder()

-	h.handleNonStream(rec, resp, "cid-hidden-thinking-tool", "deepseek-v4-pro", "prompt", false, false, []string{"search"}, nil)
+	h.handleNonStream(rec, resp, "cid-hidden-thinking-tool", "deepseek-v4-pro", "prompt", 0, false, false, []string{"search"}, nil, nil)
 	if rec.Code != http.StatusOK {
 		t.Fatalf("expected 200 for hidden thinking tool calls, got %d body=%s", rec.Code, rec.Body.String())
 	}
@@ -211,7 +212,7 @@ func TestHandleStreamToolsPlainTextStreamsBeforeFinish(t *testing.T) {
 	rec := httptest.NewRecorder()
 	req := httptest.NewRequest(http.MethodPost, "/v1/chat/completions", nil)

-	h.handleStream(rec, req, resp, "cid6", "deepseek-v4-flash", "prompt", false, false, []string{"search"}, nil)
+	h.handleStream(rec, req, resp, "cid6", "deepseek-v4-flash", "prompt", 0, false, false, []string{"search"}, nil, nil)

 	frames, done := parseSSEDataFrames(t, rec.Body.String())
 	if !done {
@@ -239,6 +240,118 @@ func TestHandleStreamToolsPlainTextStreamsBeforeFinish(t *testing.T) {
 	}
 }

+func TestHandleStreamThinkingDisabledDoesNotLeakHiddenFragmentContinuations(t *testing.T) {
+	h := &Handler{}
+	resp := makeSSEHTTPResponse(
+		`data: {"p":"response/fragments","o":"APPEND","v":[{"type":"THINK","content":"我们"}]}`,
+		`data: {"p":"response/fragments/-1/content","v":"被"}`,
+		`data: {"v":"要求"}`,
+		`data: {"p":"response/fragments","o":"APPEND","v":[{"type":"RESPONSE","content":"答"}]}`,
+		`data: {"p":"response/fragments/-1/content","v":"案"}`,
+		`data: [DONE]`,
+	)
+	rec := httptest.NewRecorder()
+	req := httptest.NewRequest(http.MethodPost, "/v1/chat/completions", nil)
+
+	h.handleStream(rec, req, resp, "cid-hidden-fragment", "deepseek-v4-flash", "prompt", 0, false, false, nil, nil, nil)
+
+	frames, done := parseSSEDataFrames(t, rec.Body.String())
+	if !done {
+		t.Fatalf("expected [DONE], body=%s", rec.Body.String())
+	}
+	content := strings.Builder{}
+	for _, frame := range frames {
+		choices, _ := frame["choices"].([]any)
+		for _, item := range choices {
+			choice, _ := item.(map[string]any)
+			delta, _ := choice["delta"].(map[string]any)
+			if c, ok := delta["content"].(string); ok {
+				content.WriteString(c)
+			}
+		}
+	}
+	if got := content.String(); got != "答案" {
+		t.Fatalf("expected only visible response text, got %q body=%s", got, rec.Body.String())
+	}
+}
+
+func TestHandleStreamEmitsSingleChoiceFramesForMultipleParsedParts(t *testing.T) {
+	h := &Handler{}
+	resp := makeSSEHTTPResponse(
+		`data: {"p":"response/fragments","o":"APPEND","v":[{"type":"THINK","content":"我们"},{"type":"THINK","content":"被"},{"type":"THINK","content":"要求"},{"type":"RESPONSE","content":"答"},{"type":"RESPONSE","content":"案"}]}`,
+		`data: [DONE]`,
+	)
+	rec := httptest.NewRecorder()
+	req := httptest.NewRequest(http.MethodPost, "/v1/chat/completions", nil)
+
+	h.handleStream(rec, req, resp, "cid-multi-parts", "deepseek-v4-pro", "prompt", 0, true, false, nil, nil, nil)
+
+	frames, done := parseSSEDataFrames(t, rec.Body.String())
+	if !done {
+		t.Fatalf("expected [DONE], body=%s", rec.Body.String())
+	}
+	var reasoning, content strings.Builder
+	for _, frame := range frames {
+		choices, _ := frame["choices"].([]any)
+		if len(choices) != 1 {
+			t.Fatalf("expected exactly one choice per stream frame, got %d frame=%#v body=%s", len(choices), frame, rec.Body.String())
+		}
+		choice, _ := choices[0].(map[string]any)
+		delta, _ := choice["delta"].(map[string]any)
+		reasoning.WriteString(asString(delta["reasoning_content"]))
+		content.WriteString(asString(delta["content"]))
+	}
+	if got := reasoning.String(); got != "我们被要求" {
+		t.Fatalf("first-choice-only client would miss reasoning tokens: got %q body=%s", got, rec.Body.String())
+	}
+	if got := content.String(); got != "答案" {
+		t.Fatalf("first-choice-only client would miss content tokens: got %q body=%s", got, rec.Body.String())
+	}
+}
+
+func TestHandleStreamCoalescesSmallContentDeltas(t *testing.T) {
+	h := &Handler{}
+	lines := make([]string, 0, 101)
+	for i := 0; i < 100; i++ {
+		b, _ := json.Marshal(map[string]any{
+			"p": "response/content",
+			"v": "字",
+		})
+		lines = append(lines, "data: "+string(b))
+	}
+	lines = append(lines, "data: [DONE]")
+	resp := makeSSEHTTPResponse(lines...)
+	rec := httptest.NewRecorder()
+	req := httptest.NewRequest(http.MethodPost, "/v1/chat/completions", nil)
+
+	h.handleStream(rec, req, resp, "cid-coalesce", "deepseek-v4-flash", "prompt", 0, false, false, nil, nil, nil)
+
+	frames, done := parseSSEDataFrames(t, rec.Body.String())
+	if !done {
+		t.Fatalf("expected [DONE], body=%s", rec.Body.String())
+	}
+	var content strings.Builder
+	contentDeltaFrames := 0
+	for _, frame := range frames {
+		choices, _ := frame["choices"].([]any)
+		if len(choices) != 1 {
+			t.Fatalf("expected exactly one choice per stream frame, got %d frame=%#v body=%s", len(choices), frame, rec.Body.String())
+		}
+		choice, _ := choices[0].(map[string]any)
+		delta, _ := choice["delta"].(map[string]any)
+		if c, ok := delta["content"].(string); ok {
+			contentDeltaFrames++
+			content.WriteString(c)
+		}
+	}
+	if got, want := content.String(), strings.Repeat("字", 100); got != want {
+		t.Fatalf("coalesced stream content mismatch: got %q want %q body=%s", got, want, rec.Body.String())
+	}
+	if contentDeltaFrames >= 100 {
+		t.Fatalf("expected coalescing to reduce 100 tiny content frames, got %d body=%s", contentDeltaFrames, rec.Body.String())
+	}
+}
+
 func TestHandleStreamIncompleteCapturedToolJSONFlushesAsTextOnFinalize(t *testing.T) {
 	h := &Handler{}
 	resp := makeSSEHTTPResponse(
@@ -248,7 +361,7 @@ func TestHandleStreamIncompleteCapturedToolJSONFlushesAsTextOnFinalize(t *testin
 	rec := httptest.NewRecorder()
 	req := httptest.NewRequest(http.MethodPost, "/v1/chat/completions", nil)

-	h.handleStream(rec, req, resp, "cid10", "deepseek-v4-flash", "prompt", false, false, []string{"search"}, nil)
+	h.handleStream(rec, req, resp, "cid10", "deepseek-v4-flash", "prompt", 0, false, false, []string{"search"}, nil, nil)

 	frames, done := parseSSEDataFrames(t, rec.Body.String())
 	if !done {
@@ -282,7 +395,7 @@ func TestHandleStreamPromotesThinkingToolCallsOnFinalizeWithoutMidstreamIntercep
 	rec := httptest.NewRecorder()
 	req := httptest.NewRequest(http.MethodPost, "/v1/chat/completions", nil)

-	h.handleStream(rec, req, resp, "cid-thinking-stream", "deepseek-v4-pro", "prompt", true, false, []string{"search"}, nil)
+	h.handleStream(rec, req, resp, "cid-thinking-stream", "deepseek-v4-pro", "prompt", 0, true, false, []string{"search"}, nil, nil)

 	frames, done := parseSSEDataFrames(t, rec.Body.String())
 	if !done {
@@ -291,20 +404,16 @@ func TestHandleStreamPromotesThinkingToolCallsOnFinalizeWithoutMidstreamIntercep
 	if !streamHasToolCallsDelta(frames) {
 		t.Fatalf("expected tool_calls delta from finalize fallback, body=%s", rec.Body.String())
 	}
-	reasoningSeen := false
 	for _, frame := range frames {
 		choices, _ := frame["choices"].([]any)
 		for _, item := range choices {
 			choice, _ := item.(map[string]any)
 			delta, _ := choice["delta"].(map[string]any)
 			if asString(delta["reasoning_content"]) != "" {
-				reasoningSeen = true
+				t.Fatalf("did not expect leaked reasoning_content markup, body=%s", rec.Body.String())
 			}
 		}
 	}
-	if !reasoningSeen {
-		t.Fatalf("expected reasoning_content to stream before finalize fallback, body=%s", rec.Body.String())
-	}
 	if streamFinishReason(frames) != "tool_calls" {
 		t.Fatalf("expected finish_reason=tool_calls, body=%s", rec.Body.String())
 	}
@@ -319,7 +428,7 @@ func TestHandleStreamPromotesHiddenThinkingDSMLToolCallsOnFinalize(t *testing.T)
 	rec := httptest.NewRecorder()
 	req := httptest.NewRequest(http.MethodPost, "/v1/chat/completions", nil)

-	h.handleStream(rec, req, resp, "cid-hidden-thinking-stream", "deepseek-v4-pro", "prompt", false, false, []string{"search"}, nil)
+	h.handleStream(rec, req, resp, "cid-hidden-thinking-stream", "deepseek-v4-pro", "prompt", 0, false, false, []string{"search"}, nil, nil)

 	frames, done := parseSSEDataFrames(t, rec.Body.String())
 	if !done {
@@ -353,7 +462,7 @@ func TestHandleStreamEmitsDistinctToolCallIDsAcrossSeparateToolBlocks(t *testing
 	rec := httptest.NewRecorder()
 	req := httptest.NewRequest(http.MethodPost, "/v1/chat/completions", nil)

-	h.handleStream(rec, req, resp, "cid-multi", "deepseek-v4-flash", "prompt", false, false, []string{"read_file", "search"}, nil)
+	h.handleStream(rec, req, resp, "cid-multi", "deepseek-v4-flash", "prompt", 0, false, false, []string{"read_file", "search"}, nil, nil)

 	frames, done := parseSSEDataFrames(t, rec.Body.String())
 	if !done {
@@ -390,3 +499,106 @@ func TestHandleStreamEmitsDistinctToolCallIDsAcrossSeparateToolBlocks(t *testing
 		t.Fatalf("expected distinct tool call ids across blocks, got %#v body=%s", ids, rec.Body.String())
 	}
 }
+
+func TestHandleStreamCoercesSchemaDeclaredStringArgumentsOnFinalize(t *testing.T) {
+	h := &Handler{}
+	line := func(v string) string {
+		b, _ := json.Marshal(map[string]any{"p": "response/content", "v": v})
+		return "data: " + string(b)
+	}
+	resp := makeSSEHTTPResponse(
+		line(`<tool_calls><invoke name="Write">{"input":{"content":{"message":"hi"},"taskId":1}}</invoke></tool_calls>`),
+		`data: [DONE]`,
+	)
+	rec := httptest.NewRecorder()
+	req := httptest.NewRequest(http.MethodPost, "/v1/chat/completions", nil)
+	toolsRaw := []any{
+		map[string]any{
+			"type": "function",
+			"function": map[string]any{
+				"name": "Write",
+				"parameters": map[string]any{
+					"type": "object",
+					"properties": map[string]any{
+						"content": map[string]any{"type": "string"},
+						"taskId":  map[string]any{"type": "string"},
+					},
+				},
+			},
+		},
+	}
+
+	h.handleStream(rec, req, resp, "cid-string-protect", "deepseek-v4-flash", "prompt", 0, false, false, []string{"Write"}, toolsRaw, nil)
+
+	frames, done := parseSSEDataFrames(t, rec.Body.String())
+	if !done {
+		t.Fatalf("expected [DONE], body=%s", rec.Body.String())
+	}
+	for _, frame := range frames {
+		choices, _ := frame["choices"].([]any)
+		for _, item := range choices {
+			choice, _ := item.(map[string]any)
+			delta, _ := choice["delta"].(map[string]any)
+			toolCalls, _ := delta["tool_calls"].([]any)
+			if len(toolCalls) == 0 {
+				continue
+			}
+			call, _ := toolCalls[0].(map[string]any)
+			fn, _ := call["function"].(map[string]any)
+			args := map[string]any{}
+			if err := json.Unmarshal([]byte(asString(fn["arguments"])), &args); err != nil {
+				t.Fatalf("decode streamed tool arguments failed: %v", err)
+			}
+			if args["content"] != `{"message":"hi"}` {
+				t.Fatalf("expected streamed content stringified by schema, got %#v", args["content"])
+			}
+			if args["taskId"] != "1" {
+				t.Fatalf("expected streamed taskId stringified by schema, got %#v", args["taskId"])
+			}
+			return
+		}
+	}
+	t.Fatalf("expected at least one streamed tool call delta, body=%s", rec.Body.String())
+}
+
+func TestHandleNonStreamWithRetryIncludesRefFileTokensInUsage(t *testing.T) {
+	h := &Handler{}
+
+	run := func(refFileTokens int) map[string]any {
+		resp := makeSSEHTTPResponse(
+			`data: {"p":"response/content","v":"hello world"}`,
+			`data: [DONE]`,
+		)
+		rec := httptest.NewRecorder()
+		h.handleNonStreamWithRetry(rec, context.Background(), nil, resp, nil, "", "cid-ref", "deepseek-v4-flash", "prompt", refFileTokens, false, false, nil, nil, nil)
+		if rec.Code != http.StatusOK {
+			t.Fatalf("expected 200, got %d body=%s", rec.Code, rec.Body.String())
+		}
+		return decodeJSONBody(t, rec.Body.String())
+	}
+
+	base := run(0)
+	withRef := run(7)
+
+	baseUsage, _ := base["usage"].(map[string]any)
+	refUsage, _ := withRef["usage"].(map[string]any)
+	if baseUsage == nil || refUsage == nil {
+		t.Fatalf("expected usage objects, base=%#v ref=%#v", base["usage"], withRef["usage"])
+	}
+
+	getInt := func(m map[string]any, key string) int {
+		t.Helper()
+		v, ok := m[key].(float64)
+		if !ok {
+			t.Fatalf("expected numeric %s, got %#v", key, m[key])
+		}
+		return int(v)
+	}
+
+	if got := getInt(refUsage, "prompt_tokens") - getInt(baseUsage, "prompt_tokens"); got != 7 {
+		t.Fatalf("expected prompt_tokens delta 7, got %d", got)
+	}
+	if got := getInt(refUsage, "total_tokens") - getInt(baseUsage, "total_tokens"); got != 7 {
+		t.Fatalf("expected total_tokens delta 7, got %d", got)
+	}
+}
--- a/internal/httpapi/openai/chat/ref_file_tokens.go
+++ b/internal/httpapi/openai/chat/ref_file_tokens.go
@@ -0,0 +1,26 @@
+package chat
+
+// addRefFileTokensToUsage adds inline-uploaded file token estimates to an existing
+// usage map inside a response object. This keeps the token accounting aware of file
+// content that the upstream model processes but that is not part of the prompt text.
+func addRefFileTokensToUsage(obj map[string]any, refFileTokens int) {
+	if refFileTokens <= 0 || obj == nil {
+		return
+	}
+	usage, ok := obj["usage"].(map[string]any)
+	if !ok || usage == nil {
+		return
+	}
+	for _, key := range []string{"input_tokens", "prompt_tokens"} {
+		if v, ok := usage[key]; ok {
+			if n, ok := v.(int); ok {
+				usage[key] = n + refFileTokens
+			}
+		}
+	}
+	if v, ok := usage["total_tokens"]; ok {
+		if n, ok := v.(int); ok {
+			usage["total_tokens"] = n + refFileTokens
+		}
+	}
+}
--- a/internal/httpapi/openai/chat/vercel_prepare_test.go
+++ b/internal/httpapi/openai/chat/vercel_prepare_test.go
@@ -10,7 +10,6 @@ import (

 	"ds2api/internal/auth"
 	dsclient "ds2api/internal/deepseek/client"
-	"ds2api/internal/promptcompat"
 )

 func TestIsVercelStreamPrepareRequest(t *testing.T) {
@@ -131,8 +130,8 @@ func TestHandleVercelStreamPrepareAppliesCurrentInputFile(t *testing.T) {
 		t.Fatalf("expected payload object, got %#v", body["payload"])
 	}
 	promptText, _ := payload["prompt"].(string)
-	if !strings.Contains(promptText, promptcompat.BuildOpenAICurrentInputContextPrompt()) {
-		t.Fatalf("expected compacted-context prompt, got %s", promptText)
+	if !strings.Contains(promptText, "Continue from the latest state in the attached DS2API_HISTORY.txt context.") {
+		t.Fatalf("expected continuation prompt, got %s", promptText)
 	}
 	if strings.Contains(promptText, "first user turn") || strings.Contains(promptText, "latest user turn") {
 		t.Fatalf("expected original turns hidden from prompt, got %s", promptText)
--- a/internal/httpapi/openai/citation_links_test.go
+++ b/internal/httpapi/openai/citation_links_test.go
@@ -26,3 +26,59 @@ func TestReplaceCitationMarkersWithLinksKeepsUnknownIndex(t *testing.T) {
 		t.Fatalf("expected %q, got %q", want, got)
 	}
 }
+
+func TestReplaceCitationMarkersWithLinksSupportsReferenceMarker(t *testing.T) {
+	raw := "新闻摘要[reference:1]，详情[reference:2]。"
+	links := map[int]string{
+		1: "https://example.com/r1",
+		2: "https://example.com/r2",
+	}
+
+	got := replaceCitationMarkersWithLinks(raw, links)
+	want := "新闻摘要[1](https://example.com/r1)，详情[2](https://example.com/r2)。"
+	if got != want {
+		t.Fatalf("expected %q, got %q", want, got)
+	}
+}
+
+func TestReplaceCitationMarkersWithLinksSupportsReferenceZeroBased(t *testing.T) {
+	raw := "来源[reference:0] 与 [reference:1]。"
+	links := map[int]string{
+		1: "https://example.com/first",
+		2: "https://example.com/second",
+	}
+
+	got := replaceCitationMarkersWithLinks(raw, links)
+	want := "来源[0](https://example.com/first) 与 [1](https://example.com/second)。"
+	if got != want {
+		t.Fatalf("expected %q, got %q", want, got)
+	}
+}
+
+func TestReplaceCitationMarkersWithLinksKeepsCitationOneBasedWithZeroBasedReference(t *testing.T) {
+	raw := "引用[citation:1]，来源[reference:0]，后续[reference:1]。"
+	links := map[int]string{
+		1: "https://example.com/first",
+		2: "https://example.com/second",
+	}
+
+	got := replaceCitationMarkersWithLinks(raw, links)
+	want := "引用[1](https://example.com/first)，来源[0](https://example.com/first)，后续[1](https://example.com/second)。"
+	if got != want {
+		t.Fatalf("expected %q, got %q", want, got)
+	}
+}
+
+func TestReplaceCitationMarkersWithLinksDetectsSpacedReferenceZeroBased(t *testing.T) {
+	raw := "来源[reference: 0] 与 [reference: 1]。"
+	links := map[int]string{
+		1: "https://example.com/first",
+		2: "https://example.com/second",
+	}
+
+	got := replaceCitationMarkersWithLinks(raw, links)
+	want := "来源[0](https://example.com/first) 与 [1](https://example.com/second)。"
+	if got != want {
+		t.Fatalf("expected %q, got %q", want, got)
+	}
+}
--- a/internal/httpapi/openai/file_inline_upload_test.go
+++ b/internal/httpapi/openai/file_inline_upload_test.go
@@ -94,6 +94,9 @@ func TestPreprocessInlineFileInputsReplacesDataURLAndCollectsRefFileIDs(t *testi
 	if len(ds.uploadCalls) != 1 {
 		t.Fatalf("expected 1 upload, got %d", len(ds.uploadCalls))
 	}
+	if ds.uploadCalls[0].ModelType != "default" {
+		t.Fatalf("expected default model type when request omits model, got %q", ds.uploadCalls[0].ModelType)
+	}
 	if ds.lastCtx != ctx {
 		t.Fatalf("expected upload to use request context")
 	}
@@ -149,7 +152,7 @@ func TestPreprocessInlineFileInputsDeduplicatesIdenticalPayloads(t *testing.T) {
 func TestChatCompletionsUploadsInlineFilesBeforeCompletion(t *testing.T) {
 	ds := &inlineUploadDSStub{}
 	h := &openAITestSurface{Store: mockOpenAIConfig{wideInput: true}, Auth: streamStatusAuthStub{}, DS: ds}
-	reqBody := `{"model":"deepseek-v4-flash","messages":[{"role":"user","content":[{"type":"input_text","text":"hi"},{"type":"image_url","image_url":{"url":"data:image/png;base64,QUJDRA=="}}]}],"stream":false}`
+	reqBody := `{"model":"deepseek-v4-vision","messages":[{"role":"user","content":[{"type":"input_text","text":"hi"},{"type":"image_url","image_url":{"url":"data:image/png;base64,QUJDRA=="}}]}],"stream":false}`
 	req := httptest.NewRequest(http.MethodPost, "/v1/chat/completions", strings.NewReader(reqBody))
 	req.Header.Set("Authorization", "Bearer direct-token")
 	req.Header.Set("Content-Type", "application/json")
@@ -163,6 +166,9 @@ func TestChatCompletionsUploadsInlineFilesBeforeCompletion(t *testing.T) {
 	if len(ds.uploadCalls) != 1 {
 		t.Fatalf("expected 1 upload call, got %d", len(ds.uploadCalls))
 	}
+	if ds.uploadCalls[0].ModelType != "vision" {
+		t.Fatalf("expected vision model type for vision request, got %q", ds.uploadCalls[0].ModelType)
+	}
 	if ds.completionReq == nil {
 		t.Fatal("expected completion payload to be captured")
 	}
@@ -177,7 +183,7 @@ func TestResponsesUploadsInlineFilesBeforeCompletion(t *testing.T) {
 	h := &openAITestSurface{Store: mockOpenAIConfig{wideInput: true}, Auth: streamStatusAuthStub{}, DS: ds}
 	r := chi.NewRouter()
 	registerOpenAITestRoutes(r, h)
-	reqBody := `{"model":"deepseek-v4-flash","input":[{"role":"user","content":[{"type":"input_text","text":"hi"},{"type":"input_image","image_url":{"url":"data:image/png;base64,QUJDRA=="}}]}],"stream":false}`
+	reqBody := `{"model":"deepseek-v4-pro","input":[{"role":"user","content":[{"type":"input_text","text":"hi"},{"type":"input_image","image_url":{"url":"data:image/png;base64,QUJDRA=="}}]}],"stream":false}`
 	req := httptest.NewRequest(http.MethodPost, "/v1/responses", strings.NewReader(reqBody))
 	req.Header.Set("Authorization", "Bearer direct-token")
 	req.Header.Set("Content-Type", "application/json")
@@ -191,6 +197,9 @@ func TestResponsesUploadsInlineFilesBeforeCompletion(t *testing.T) {
 	if len(ds.uploadCalls) != 1 {
 		t.Fatalf("expected 1 upload call, got %d", len(ds.uploadCalls))
 	}
+	if ds.uploadCalls[0].ModelType != "expert" {
+		t.Fatalf("expected expert model type for pro request, got %q", ds.uploadCalls[0].ModelType)
+	}
 	refIDs, _ := ds.completionReq["ref_file_ids"].([]any)
 	if len(refIDs) != 1 || refIDs[0] != "file-inline-1" {
 		t.Fatalf("unexpected completion ref_file_ids: %#v", ds.completionReq["ref_file_ids"])
@@ -216,6 +225,45 @@ func TestChatCompletionsInlineUploadFailureReturnsBadRequest(t *testing.T) {
 	}
 }

+func TestChatCompletionsInlineUploadLimitReturnsBadRequest(t *testing.T) {
+	ds := &inlineUploadDSStub{}
+	h := &openAITestSurface{Store: mockOpenAIConfig{wideInput: true}, Auth: streamStatusAuthStub{}, DS: ds}
+	content := []any{map[string]any{"type": "input_text", "text": "hi"}}
+	for i := 0; i < 51; i++ {
+		content = append(content, map[string]any{
+			"type":      "image_url",
+			"image_url": map[string]any{"url": "data:image/png;base64,QUJDRA=="},
+		})
+	}
+	body, err := json.Marshal(map[string]any{
+		"model": "deepseek-v4-flash",
+		"messages": []any{map[string]any{
+			"role":    "user",
+			"content": content,
+		}},
+		"stream": false,
+	})
+	if err != nil {
+		t.Fatalf("marshal request: %v", err)
+	}
+	req := httptest.NewRequest(http.MethodPost, "/v1/chat/completions", strings.NewReader(string(body)))
+	req.Header.Set("Authorization", "Bearer direct-token")
+	req.Header.Set("Content-Type", "application/json")
+	rec := httptest.NewRecorder()
+
+	h.ChatCompletions(rec, req)
+
+	if rec.Code != http.StatusBadRequest {
+		t.Fatalf("expected 400, got %d body=%s", rec.Code, rec.Body.String())
+	}
+	if !strings.Contains(rec.Body.String(), "exceeded maximum of 50 inline files per request") {
+		t.Fatalf("expected inline file limit error, got body=%s", rec.Body.String())
+	}
+	if ds.completionReq != nil {
+		t.Fatalf("did not expect completion call after inline file limit error")
+	}
+}
+
 func TestResponsesInlineUploadFailureReturnsInternalServerError(t *testing.T) {
 	ds := &inlineUploadDSStub{uploadErr: errors.New("boom")}
 	h := &openAITestSurface{Store: mockOpenAIConfig{wideInput: true}, Auth: streamStatusAuthStub{}, DS: ds}
--- a/internal/httpapi/openai/files/file_inline_upload.go
+++ b/internal/httpapi/openai/files/file_inline_upload.go
@@ -12,6 +12,7 @@ import (
 	"strings"

 	"ds2api/internal/auth"
+	"ds2api/internal/config"
 	dsclient "ds2api/internal/deepseek/client"
 	"ds2api/internal/httpapi/openai/shared"
 	"ds2api/internal/promptcompat"
@@ -39,11 +40,13 @@ func (e *inlineFileUploadError) Error() string {
 }

 type inlineUploadState struct {
-	ctx          context.Context
-	handler      *Handler
-	auth         *auth.RequestAuth
-	uploadedByID map[string]string
-	uploadCount  int
+	ctx             context.Context
+	handler         *Handler
+	auth            *auth.RequestAuth
+	modelType       string
+	uploadedByID    map[string]string
+	uploadCount     int
+	inlineFileBytes int
 }

 type inlineDecodedFile struct {
@@ -57,10 +60,19 @@ func (h *Handler) PreprocessInlineFileInputs(ctx context.Context, a *auth.Reques
 	if h == nil || h.DS == nil || len(req) == 0 {
 		return nil
 	}
+	modelType := "default"
+	if requestedModel, ok := req["model"].(string); ok {
+		if resolvedModel, ok := config.ResolveModel(h.Store, requestedModel); ok {
+			if resolvedType, ok := config.GetModelType(resolvedModel); ok {
+				modelType = resolvedType
+			}
+		}
+	}
 	state := &inlineUploadState{
 		ctx:          ctx,
 		handler:      h,
 		auth:         a,
+		modelType:    modelType,
 		uploadedByID: map[string]string{},
 	}
 	for _, key := range []string{"messages", "input", "attachments"} {
@@ -75,6 +87,9 @@ func (h *Handler) PreprocessInlineFileInputs(ctx context.Context, a *auth.Reques
 	if refIDs := promptcompat.CollectOpenAIRefFileIDs(req); len(refIDs) > 0 {
 		req["ref_file_ids"] = stringsToAnySlice(refIDs)
 	}
+	if state.inlineFileBytes > 0 {
+		req["_inline_file_bytes"] = state.inlineFileBytes
+	}
 	return nil
 }

@@ -135,13 +150,15 @@ func (s *inlineUploadState) tryUploadBlock(block map[string]any) (map[string]any
 		return nil, false, nil
 	}
 	if s.uploadCount >= maxInlineFilesPerRequest {
-		return nil, true, fmt.Errorf("exceeded maximum of %d inline files per request", maxInlineFilesPerRequest)
+		err := fmt.Errorf("exceeded maximum of %d inline files per request", maxInlineFilesPerRequest)
+		return nil, true, &inlineFileUploadError{status: http.StatusBadRequest, message: err.Error(), err: err}
 	}
 	fileID, err := s.uploadInlineFile(decoded)
 	if err != nil {
 		return nil, true, &inlineFileUploadError{status: http.StatusInternalServerError, message: "Failed to upload inline file.", err: err}
 	}
 	s.uploadCount++
+	s.inlineFileBytes += len(decoded.Data)
 	replacement := map[string]any{
 		"type":    decoded.ReplacementType,
 		"file_id": fileID,
@@ -168,6 +185,7 @@ func (s *inlineUploadState) uploadInlineFile(file inlineDecodedFile) (string, er
 	result, err := s.handler.DS.UploadFile(s.ctx, s.auth, dsclient.UploadFileRequest{
 		Filename:    file.Filename,
 		ContentType: contentType,
+		ModelType:   s.modelType,
 		Data:        file.Data,
 	}, 3)
 	if err != nil {
--- a/internal/httpapi/openai/files/handler_files.go
+++ b/internal/httpapi/openai/files/handler_files.go
@@ -8,6 +8,7 @@ import (

 	"ds2api/internal/auth"
 	"ds2api/internal/chathistory"
+	"ds2api/internal/config"
 	dsclient "ds2api/internal/deepseek/client"
 	"ds2api/internal/httpapi/openai/shared"
 )
@@ -66,10 +67,12 @@ func (h *Handler) UploadFile(w http.ResponseWriter, r *http.Request) {
 	if contentType == "" && len(data) > 0 {
 		contentType = http.DetectContentType(data)
 	}
+	modelType := resolveUploadModelType(h.Store, r)
 	result, err := h.DS.UploadFile(r.Context(), a, dsclient.UploadFileRequest{
 		Filename:    header.Filename,
 		ContentType: contentType,
 		Purpose:     strings.TrimSpace(r.FormValue("purpose")),
+		ModelType:   modelType,
 		Data:        data,
 	}, 3)
 	if err != nil {
@@ -82,6 +85,32 @@ func (h *Handler) UploadFile(w http.ResponseWriter, r *http.Request) {
 	shared.WriteJSON(w, http.StatusOK, buildOpenAIFileObject(result))
 }

+func resolveUploadModelType(store shared.ConfigReader, r *http.Request) string {
+	for _, candidate := range []string{r.FormValue("model_type"), r.Header.Get("X-Model-Type")} {
+		if modelType := normalizeUploadModelType(candidate); modelType != "" {
+			return modelType
+		}
+	}
+	requestedModel := strings.TrimSpace(r.FormValue("model"))
+	if requestedModel != "" {
+		if resolvedModel, ok := config.ResolveModel(store, requestedModel); ok {
+			if modelType, ok := config.GetModelType(resolvedModel); ok {
+				return modelType
+			}
+		}
+	}
+	return "default"
+}
+
+func normalizeUploadModelType(raw string) string {
+	switch strings.ToLower(strings.TrimSpace(raw)) {
+	case "default", "expert", "vision":
+		return strings.ToLower(strings.TrimSpace(raw))
+	default:
+		return ""
+	}
+}
+
 func buildOpenAIFileObject(result *dsclient.UploadFileResult) map[string]any {
 	if result == nil {
 		obj := map[string]any{
--- a/internal/httpapi/openai/files_route_test.go
+++ b/internal/httpapi/openai/files_route_test.go
@@ -77,7 +77,7 @@ func (m *filesRouteDSStub) DeleteAllSessionsForToken(_ context.Context, _ string
 	return nil
 }

-func newMultipartUploadRequest(t *testing.T, purpose string, filename string, data []byte) *http.Request {
+func newMultipartUploadRequest(t *testing.T, purpose string, filename string, data []byte, model string) *http.Request {
 	t.Helper()
 	var body bytes.Buffer
 	writer := multipart.NewWriter(&body)
@@ -86,6 +86,11 @@ func newMultipartUploadRequest(t *testing.T, purpose string, filename string, da
 			t.Fatalf("write purpose failed: %v", err)
 		}
 	}
+	if model != "" {
+		if err := writer.WriteField("model", model); err != nil {
+			t.Fatalf("write model failed: %v", err)
+		}
+	}
 	part, err := writer.CreateFormFile("file", filename)
 	if err != nil {
 		t.Fatalf("create form file failed: %v", err)
@@ -108,7 +113,7 @@ func TestFilesRouteUploadSuccess(t *testing.T) {
 	r := chi.NewRouter()
 	registerOpenAITestRoutes(r, h)

-	req := newMultipartUploadRequest(t, "assistants", "notes.txt", []byte("hello world"))
+	req := newMultipartUploadRequest(t, "assistants", "notes.txt", []byte("hello world"), "deepseek-v4-vision")
 	rec := httptest.NewRecorder()
 	r.ServeHTTP(rec, req)

@@ -121,6 +126,9 @@ func TestFilesRouteUploadSuccess(t *testing.T) {
 	if ds.lastReq.Purpose != "assistants" {
 		t.Fatalf("expected purpose assistants, got %q", ds.lastReq.Purpose)
 	}
+	if ds.lastReq.ModelType != "vision" {
+		t.Fatalf("expected vision model type, got %q", ds.lastReq.ModelType)
+	}
 	if string(ds.lastReq.Data) != "hello world" {
 		t.Fatalf("unexpected uploaded data: %q", string(ds.lastReq.Data))
 	}
@@ -145,7 +153,7 @@ func TestFilesRouteUploadIncludesAccountIDForManagedAccount(t *testing.T) {
 	r := chi.NewRouter()
 	registerOpenAITestRoutes(r, h)

-	req := newMultipartUploadRequest(t, "assistants", "notes.txt", []byte("hello world"))
+	req := newMultipartUploadRequest(t, "assistants", "notes.txt", []byte("hello world"), "deepseek-v4-vision")
 	rec := httptest.NewRecorder()
 	r.ServeHTTP(rec, req)

--- a/internal/httpapi/openai/history/current_input_file.go
+++ b/internal/httpapi/openai/history/current_input_file.go
@@ -7,13 +7,14 @@ import (
 	"strings"

 	"ds2api/internal/auth"
+	"ds2api/internal/config"
 	dsclient "ds2api/internal/deepseek/client"
 	"ds2api/internal/httpapi/openai/shared"
 	"ds2api/internal/promptcompat"
 )

 const (
-	currentInputFilename    = "IGNORE.txt"
+	currentInputFilename    = promptcompat.CurrentInputContextFilename
 	currentInputContentType = "text/plain; charset=utf-8"
 	currentInputPurpose     = "assistants"
 )
@@ -35,11 +36,15 @@ func (s Service) ApplyCurrentInputFile(ctx context.Context, a *auth.RequestAuth,
 	if strings.TrimSpace(fileText) == "" {
 		return stdReq, errors.New("current user input file produced empty transcript")
 	}
-
+	modelType := "default"
+	if resolvedType, ok := config.GetModelType(stdReq.ResolvedModel); ok {
+		modelType = resolvedType
+	}
 	result, err := s.DS.UploadFile(ctx, a, dsclient.UploadFileRequest{
 		Filename:    currentInputFilename,
 		ContentType: currentInputContentType,
 		Purpose:     currentInputPurpose,
+		ModelType:   modelType,
 		Data:        []byte(fileText),
 	}, 3)
 	if err != nil {
@@ -58,9 +63,13 @@ func (s Service) ApplyCurrentInputFile(ctx context.Context, a *auth.RequestAuth,
 	}

 	stdReq.Messages = messages
+	stdReq.HistoryText = fileText
 	stdReq.CurrentInputFileApplied = true
 	stdReq.RefFileIDs = prependUniqueRefFileID(stdReq.RefFileIDs, fileID)
 	stdReq.FinalPrompt, stdReq.ToolNames = promptcompat.BuildOpenAIPrompt(messages, stdReq.ToolsRaw, "", stdReq.ToolChoice, stdReq.Thinking)
+	// Token accounting must reflect the actual downstream context:
+	// the uploaded DS2API_HISTORY.txt file content + the continuation live prompt.
+	stdReq.PromptTokenText = fileText + "\n" + stdReq.FinalPrompt
 	return stdReq, nil
 }

@@ -84,5 +93,5 @@ func latestUserInputForFile(messages []any) (int, string) {
 }

 func currentInputFilePrompt() string {
-	return promptcompat.BuildOpenAICurrentInputContextPrompt()
+	return "Continue from the latest state in the attached DS2API_HISTORY.txt context. Treat it as the current working state and answer the latest user request directly."
 }
--- a/internal/httpapi/openai/history_split_test.go
+++ b/internal/httpapi/openai/history_split_test.go
@@ -14,6 +14,7 @@ import (
 	"ds2api/internal/auth"
 	dsclient "ds2api/internal/deepseek/client"
 	"ds2api/internal/promptcompat"
+	"ds2api/internal/util"
 )

 func historySplitTestMessages() []any {
@@ -60,30 +61,32 @@ func (streamStatusManagedAuthStub) DetermineCaller(_ *http.Request) (*auth.Reque

 func (streamStatusManagedAuthStub) Release(_ *auth.RequestAuth) {}

-func TestBuildOpenAICurrentInputContextTranscriptUsesInjectedFileWrapper(t *testing.T) {
+func TestBuildOpenAICurrentInputContextTranscriptUsesNumberedHistorySections(t *testing.T) {
 	_, historyMessages := splitOpenAIHistoryMessages(historySplitTestMessages(), 1)
 	transcript := buildOpenAICurrentInputContextTranscript(historyMessages)

-	if !strings.HasPrefix(transcript, "[file content end]\n\n") {
-		t.Fatalf("expected injected file wrapper prefix, got %q", transcript)
+	if strings.Contains(transcript, "[file content end]") || strings.Contains(transcript, "[file content begin]") || strings.Contains(transcript, "[file name]:") {
+		t.Fatalf("expected transcript without file wrapper tags, got %q", transcript)
 	}
-	if !strings.Contains(transcript, "[context note]") || !strings.Contains(transcript, "compacted snapshot of the prior conversation history") {
-		t.Fatalf("expected compacted context note in transcript, got %q", transcript)
+	if !strings.Contains(transcript, "# DS2API_HISTORY.txt") {
+		t.Fatalf("expected history transcript header, got %q", transcript)
 	}
-	if !strings.Contains(transcript, "<｜begin▁of▁sentence｜>") {
-		t.Fatalf("expected serialized conversation markers, got %q", transcript)
+	if !strings.Contains(transcript, "Prior conversation history and tool progress.") {
+		t.Fatalf("expected history transcript description, got %q", transcript)
 	}
-	if !strings.Contains(transcript, "first user turn") || !strings.Contains(transcript, "tool result") {
-		t.Fatalf("expected historical turns preserved, got %q", transcript)
-	}
-	if !strings.Contains(transcript, "[reasoning_content]") || !strings.Contains(transcript, "hidden reasoning") {
-		t.Fatalf("expected reasoning block preserved, got %q", transcript)
-	}
-	if !strings.Contains(transcript, "<|DSML|tool_calls>") {
-		t.Fatalf("expected tool calls preserved, got %q", transcript)
-	}
-	if !strings.HasSuffix(transcript, "\n[file name]: IGNORE\n[file content begin]\n") {
-		t.Fatalf("expected injected file wrapper suffix, got %q", transcript)
+	for _, want := range []string{
+		"=== 1. USER ===",
+		"=== 2. ASSISTANT ===",
+		"=== 3. TOOL ===",
+		"first user turn",
+		"tool result",
+		"[reasoning_content]",
+		"hidden reasoning",
+		"<|DSML|tool_calls>",
+	} {
+		if !strings.Contains(transcript, want) {
+			t.Fatalf("expected transcript to contain %q, got %q", want, transcript)
+		}
 	}
 }

@@ -224,7 +227,7 @@ func TestApplyCurrentInputFileDisabledPassThrough(t *testing.T) {
 		DS: ds,
 	}
 	req := map[string]any{
-		"model":    "deepseek-v4-flash",
+		"model":    "deepseek-v4-vision",
 		"messages": historySplitTestMessages(),
 	}
 	stdReq, err := promptcompat.NormalizeOpenAIChatRequest(h.Store, req, "")
@@ -247,7 +250,7 @@ func TestApplyCurrentInputFileDisabledPassThrough(t *testing.T) {
 	}
 }

-func TestApplyCurrentInputFileUploadsFirstTurnWithInjectedWrapper(t *testing.T) {
+func TestApplyCurrentInputFileUploadsFirstTurnWithNumberedHistoryTranscript(t *testing.T) {
 	ds := &inlineUploadDSStub{}
 	h := &openAITestSurface{
 		Store: mockOpenAIConfig{
@@ -277,34 +280,90 @@ func TestApplyCurrentInputFileUploadsFirstTurnWithInjectedWrapper(t *testing.T)
 		t.Fatalf("expected 1 current input upload, got %d", len(ds.uploadCalls))
 	}
 	upload := ds.uploadCalls[0]
-	if upload.Filename != "IGNORE.txt" {
+	if upload.Filename != "DS2API_HISTORY.txt" {
 		t.Fatalf("unexpected upload filename: %q", upload.Filename)
 	}
 	uploadedText := string(upload.Data)
-	if !strings.HasPrefix(uploadedText, "[file content end]\n\n") {
-		t.Fatalf("expected injected file wrapper prefix, got %q", uploadedText)
+	if strings.Contains(uploadedText, "[file content end]") || strings.Contains(uploadedText, "[file content begin]") || strings.Contains(uploadedText, "[file name]:") {
+		t.Fatalf("expected uploaded transcript without file wrapper tags, got %q", uploadedText)
 	}
-	if !strings.Contains(uploadedText, "<｜begin▁of▁sentence｜><｜User｜>first turn content that is long enough") {
-		t.Fatalf("expected serialized current user turn markers, got %q", uploadedText)
+	for _, want := range []string{
+		"# DS2API_HISTORY.txt",
+		"=== 1. USER ===",
+		"first turn content that is long enough",
+	} {
+		if !strings.Contains(uploadedText, want) {
+			t.Fatalf("expected uploaded transcript to contain %q, got %q", want, uploadedText)
+		}
 	}
 	if !strings.Contains(uploadedText, promptcompat.ThinkingInjectionMarker) {
 		t.Fatalf("expected thinking injection in current input file, got %q", uploadedText)
 	}
-	if !strings.HasSuffix(uploadedText, "\n[file name]: IGNORE\n[file content begin]\n") {
-		t.Fatalf("expected injected file wrapper suffix, got %q", uploadedText)
-	}
+
 	if strings.Contains(out.FinalPrompt, "first turn content that is long enough") {
 		t.Fatalf("expected current input text to be replaced in live prompt, got %s", out.FinalPrompt)
 	}
-	if strings.Contains(out.FinalPrompt, "CURRENT_USER_INPUT.txt") || strings.Contains(out.FinalPrompt, "IGNORE.txt") || strings.Contains(out.FinalPrompt, "Read that file") {
+	if strings.Contains(out.FinalPrompt, "CURRENT_USER_INPUT.txt") || strings.Contains(out.FinalPrompt, "Read that file") {
 		t.Fatalf("expected live prompt not to instruct file reads, got %s", out.FinalPrompt)
 	}
-	if !strings.Contains(out.FinalPrompt, promptcompat.BuildOpenAICurrentInputContextPrompt()) {
-		t.Fatalf("expected compacted-context instruction in live prompt, got %s", out.FinalPrompt)
+	if !strings.Contains(out.FinalPrompt, "Continue from the latest state in the attached DS2API_HISTORY.txt context.") {
+		t.Fatalf("expected continuation-oriented prompt in live prompt, got %s", out.FinalPrompt)
 	}
 	if len(out.RefFileIDs) != 1 || out.RefFileIDs[0] != "file-inline-1" {
 		t.Fatalf("expected current input file id in ref_file_ids, got %#v", out.RefFileIDs)
 	}
+	if !strings.Contains(out.PromptTokenText, "first turn content that is long enough") {
+		t.Fatalf("expected prompt token text to preserve original full context, got %q", out.PromptTokenText)
+	}
+	if !strings.Contains(out.PromptTokenText, "# DS2API_HISTORY.txt") || !strings.Contains(out.PromptTokenText, "=== 1. USER ===") {
+		t.Fatalf("expected prompt token text to include numbered history transcript, got %q", out.PromptTokenText)
+	}
+}
+
+func TestApplyCurrentInputFilePreservesFullContextPromptForTokenCounting(t *testing.T) {
+	ds := &inlineUploadDSStub{}
+	h := &openAITestSurface{
+		Store: mockOpenAIConfig{
+			wideInput:           true,
+			currentInputEnabled: true,
+			currentInputMin:     0,
+			thinkingInjection:   boolPtr(true),
+		},
+		DS: ds,
+	}
+	req := map[string]any{
+		"model":    "deepseek-v4-vision",
+		"messages": historySplitTestMessages(),
+	}
+	stdReq, err := promptcompat.NormalizeOpenAIChatRequest(h.Store, req, "")
+	if err != nil {
+		t.Fatalf("normalize failed: %v", err)
+	}
+
+	out, err := h.applyCurrentInputFile(context.Background(), &auth.RequestAuth{DeepSeekToken: "token"}, stdReq)
+	if err != nil {
+		t.Fatalf("apply current input file failed: %v", err)
+	}
+	if out.FinalPrompt == stdReq.FinalPrompt {
+		t.Fatalf("expected live prompt to be rewritten after current input file")
+	}
+	// PromptTokenText must include the uploaded file content (which contains the full context)
+	// plus the neutral live prompt — reflecting the actual downstream token cost.
+	if !strings.Contains(out.PromptTokenText, "first user turn") || !strings.Contains(out.PromptTokenText, "latest user turn") {
+		t.Fatalf("expected prompt token text to contain file context with full conversation, got %q", out.PromptTokenText)
+	}
+	if strings.Contains(out.PromptTokenText, "[file content end]") || strings.Contains(out.PromptTokenText, "[file name]:") {
+		t.Fatalf("expected prompt token text to omit file wrapper tags, got %q", out.PromptTokenText)
+	}
+	if !strings.Contains(out.PromptTokenText, "# DS2API_HISTORY.txt") || !strings.Contains(out.PromptTokenText, "=== 1. SYSTEM ===") {
+		t.Fatalf("expected prompt token text to include numbered history transcript, got %q", out.PromptTokenText)
+	}
+	if !strings.Contains(out.PromptTokenText, "Continue from the latest state in the attached DS2API_HISTORY.txt context.") {
+		t.Fatalf("expected prompt token text to also include continuation prompt, got %q", out.PromptTokenText)
+	}
+	if strings.Contains(out.FinalPrompt, "first user turn") || strings.Contains(out.FinalPrompt, "latest user turn") {
+		t.Fatalf("expected live prompt to hide original turns, got %q", out.FinalPrompt)
+	}
 }

 func TestApplyCurrentInputFileUploadsFullContextFile(t *testing.T) {
@@ -319,7 +378,7 @@ func TestApplyCurrentInputFileUploadsFullContextFile(t *testing.T) {
 		DS: ds,
 	}
 	req := map[string]any{
-		"model":    "deepseek-v4-flash",
+		"model":    "deepseek-v4-vision",
 		"messages": historySplitTestMessages(),
 	}
 	stdReq, err := promptcompat.NormalizeOpenAIChatRequest(h.Store, req, "")
@@ -338,24 +397,27 @@ func TestApplyCurrentInputFileUploadsFullContextFile(t *testing.T) {
 		t.Fatalf("expected one current input upload, got %d", len(ds.uploadCalls))
 	}
 	upload := ds.uploadCalls[0]
-	if upload.Filename != "IGNORE.txt" {
-		t.Fatalf("expected IGNORE.txt upload, got %q", upload.Filename)
+	if upload.Filename != "DS2API_HISTORY.txt" {
+		t.Fatalf("expected DS2API_HISTORY.txt upload, got %q", upload.Filename)
+	}
+	if upload.ModelType != "vision" {
+		t.Fatalf("expected vision model type for vision request, got %q", upload.ModelType)
 	}
 	uploadedText := string(upload.Data)
-	for _, want := range []string{"system instructions", "first user turn", "hidden reasoning", "tool result", "latest user turn", promptcompat.ThinkingInjectionMarker} {
+	for _, want := range []string{"# DS2API_HISTORY.txt", "=== 1. SYSTEM ===", "=== 2. USER ===", "=== 3. ASSISTANT ===", "=== 4. TOOL ===", "=== 5. USER ===", "system instructions", "first user turn", "hidden reasoning", "tool result", "latest user turn", promptcompat.ThinkingInjectionMarker} {
 		if !strings.Contains(uploadedText, want) {
 			t.Fatalf("expected full context file to contain %q, got %q", want, uploadedText)
 		}
 	}
-	if strings.Contains(out.FinalPrompt, "first user turn") || strings.Contains(out.FinalPrompt, "latest user turn") || strings.Contains(out.FinalPrompt, "CURRENT_USER_INPUT.txt") || strings.Contains(out.FinalPrompt, "IGNORE.txt") || strings.Contains(out.FinalPrompt, "Read that file") {
-		t.Fatalf("expected live prompt to stay in compacted-context mode, got %s", out.FinalPrompt)
+	if strings.Contains(out.FinalPrompt, "first user turn") || strings.Contains(out.FinalPrompt, "latest user turn") || strings.Contains(out.FinalPrompt, "CURRENT_USER_INPUT.txt") || strings.Contains(out.FinalPrompt, "Read that file") {
+		t.Fatalf("expected live prompt to use only a continuation instruction, got %s", out.FinalPrompt)
 	}
-	if !strings.Contains(out.FinalPrompt, promptcompat.BuildOpenAICurrentInputContextPrompt()) {
-		t.Fatalf("expected compacted-context instruction in live prompt, got %s", out.FinalPrompt)
+	if !strings.Contains(out.FinalPrompt, "Continue from the latest state in the attached DS2API_HISTORY.txt context.") {
+		t.Fatalf("expected continuation-oriented prompt in live prompt, got %s", out.FinalPrompt)
 	}
 }

-func TestApplyCurrentInputFileLeavesHistoryTextEmpty(t *testing.T) {
+func TestApplyCurrentInputFileCarriesHistoryText(t *testing.T) {
 	ds := &inlineUploadDSStub{}
 	h := &openAITestSurface{
 		Store: mockOpenAIConfig{
@@ -380,8 +442,11 @@ func TestApplyCurrentInputFileLeavesHistoryTextEmpty(t *testing.T) {
 	if len(ds.uploadCalls) != 1 {
 		t.Fatalf("expected 1 upload call, got %d", len(ds.uploadCalls))
 	}
-	if out.HistoryText != "" {
-		t.Fatalf("expected current input file flow to leave history text empty, got %q", out.HistoryText)
+	if out.HistoryText != string(ds.uploadCalls[0].Data) {
+		t.Fatalf("expected current input file flow to preserve uploaded text in history, got %q", out.HistoryText)
+	}
+	if !strings.Contains(out.HistoryText, "# DS2API_HISTORY.txt") || !strings.Contains(out.HistoryText, "=== 1. SYSTEM ===") {
+		t.Fatalf("expected history text to use numbered transcript format, got %q", out.HistoryText)
 	}
 }

@@ -414,15 +479,18 @@ func TestChatCompletionsCurrentInputFileUploadsContextAndKeepsNeutralPrompt(t *t
 		t.Fatalf("expected 1 upload call, got %d", len(ds.uploadCalls))
 	}
 	upload := ds.uploadCalls[0]
-	if upload.Filename != "IGNORE.txt" {
+	if upload.Filename != "DS2API_HISTORY.txt" {
 		t.Fatalf("unexpected upload filename: %q", upload.Filename)
 	}
 	if upload.Purpose != "assistants" {
 		t.Fatalf("unexpected purpose: %q", upload.Purpose)
 	}
 	historyText := string(upload.Data)
-	if !strings.Contains(historyText, "[file content end]") || !strings.Contains(historyText, "[file name]: IGNORE") {
-		t.Fatalf("expected injected IGNORE wrapper, got %s", historyText)
+	if strings.Contains(historyText, "[file content end]") || strings.Contains(historyText, "[file content begin]") || strings.Contains(historyText, "[file name]:") {
+		t.Fatalf("expected history transcript without file wrapper tags, got %s", historyText)
+	}
+	if !strings.Contains(historyText, "# DS2API_HISTORY.txt") || !strings.Contains(historyText, "=== 1. SYSTEM ===") {
+		t.Fatalf("expected history transcript to use numbered sections, got %s", historyText)
 	}
 	if !strings.Contains(historyText, "latest user turn") {
 		t.Fatalf("expected full context to include latest turn, got %s", historyText)
@@ -431,8 +499,8 @@ func TestChatCompletionsCurrentInputFileUploadsContextAndKeepsNeutralPrompt(t *t
 		t.Fatal("expected completion payload to be captured")
 	}
 	promptText, _ := ds.completionReq["prompt"].(string)
-	if !strings.Contains(promptText, promptcompat.BuildOpenAICurrentInputContextPrompt()) {
-		t.Fatalf("expected compacted-context prompt, got %s", promptText)
+	if !strings.Contains(promptText, "Continue from the latest state in the attached DS2API_HISTORY.txt context.") {
+		t.Fatalf("expected continuation-oriented prompt, got %s", promptText)
 	}
 	if strings.Contains(promptText, "first user turn") || strings.Contains(promptText, "latest user turn") {
 		t.Fatalf("expected prompt to hide original turns, got %s", promptText)
@@ -441,6 +509,16 @@ func TestChatCompletionsCurrentInputFileUploadsContextAndKeepsNeutralPrompt(t *t
 	if len(refIDs) == 0 || refIDs[0] != "file-inline-1" {
 		t.Fatalf("expected uploaded current input file to be first ref_file_id, got %#v", ds.completionReq["ref_file_ids"])
 	}
+	var body map[string]any
+	if err := json.Unmarshal(rec.Body.Bytes(), &body); err != nil {
+		t.Fatalf("decode response failed: %v", err)
+	}
+	usage, _ := body["usage"].(map[string]any)
+	promptTokens := int(usage["prompt_tokens"].(float64))
+	neutralCount := util.CountPromptTokens(promptText, "deepseek-v4-flash")
+	if promptTokens <= neutralCount {
+		t.Fatalf("expected prompt_tokens to exceed neutral live prompt count (includes file context), got=%d neutral=%d", promptTokens, neutralCount)
+	}
 }

 func TestResponsesCurrentInputFileUploadsContextAndKeepsNeutralPrompt(t *testing.T) {
@@ -473,16 +551,30 @@ func TestResponsesCurrentInputFileUploadsContextAndKeepsNeutralPrompt(t *testing
 	if len(ds.uploadCalls) != 1 {
 		t.Fatalf("expected 1 upload call, got %d", len(ds.uploadCalls))
 	}
+	historyText := string(ds.uploadCalls[0].Data)
+	if !strings.Contains(historyText, "# DS2API_HISTORY.txt") || !strings.Contains(historyText, "=== 1. SYSTEM ===") {
+		t.Fatalf("expected uploaded history text to use numbered transcript format, got %s", historyText)
+	}
 	if ds.completionReq == nil {
 		t.Fatal("expected completion payload to be captured")
 	}
 	promptText, _ := ds.completionReq["prompt"].(string)
-	if !strings.Contains(promptText, promptcompat.BuildOpenAICurrentInputContextPrompt()) {
-		t.Fatalf("expected compacted-context prompt, got %s", promptText)
+	if !strings.Contains(promptText, "Continue from the latest state in the attached DS2API_HISTORY.txt context.") {
+		t.Fatalf("expected continuation-oriented prompt, got %s", promptText)
 	}
 	if strings.Contains(promptText, "first user turn") || strings.Contains(promptText, "latest user turn") {
 		t.Fatalf("expected prompt to hide original turns, got %s", promptText)
 	}
+	var body map[string]any
+	if err := json.Unmarshal(rec.Body.Bytes(), &body); err != nil {
+		t.Fatalf("decode response failed: %v", err)
+	}
+	usage, _ := body["usage"].(map[string]any)
+	inputTokens := int(usage["input_tokens"].(float64))
+	neutralCount := util.CountPromptTokens(promptText, "deepseek-v4-flash")
+	if inputTokens <= neutralCount {
+		t.Fatalf("expected input_tokens to exceed neutral live prompt count (includes file context), got=%d neutral=%d", inputTokens, neutralCount)
+	}
 }

 func TestChatCompletionsCurrentInputFileMapsManagedAuthFailureTo401(t *testing.T) {
@@ -609,11 +701,15 @@ func TestCurrentInputFileWorksAcrossAutoDeleteModes(t *testing.T) {
 			if len(ds.uploadCalls) != 1 {
 				t.Fatalf("expected current input upload for mode=%s, got %d", mode, len(ds.uploadCalls))
 			}
+			historyText := string(ds.uploadCalls[0].Data)
+			if !strings.Contains(historyText, "# DS2API_HISTORY.txt") || !strings.Contains(historyText, "=== 1. SYSTEM ===") {
+				t.Fatalf("expected uploaded history text to use numbered transcript format, got %s", historyText)
+			}
 			if ds.completionReq == nil {
 				t.Fatalf("expected completion payload for mode=%s", mode)
 			}
 			promptText, _ := ds.completionReq["prompt"].(string)
-			if !strings.Contains(promptText, promptcompat.BuildOpenAICurrentInputContextPrompt()) || strings.Contains(promptText, "first user turn") || strings.Contains(promptText, "latest user turn") {
+			if !strings.Contains(promptText, "Continue from the latest state in the attached DS2API_HISTORY.txt context.") || strings.Contains(promptText, "first user turn") || strings.Contains(promptText, "latest user turn") {
 				t.Fatalf("unexpected prompt for mode=%s: %s", mode, promptText)
 			}
 		})
--- a/internal/httpapi/openai/leaked_output_sanitize_test.go
+++ b/internal/httpapi/openai/leaked_output_sanitize_test.go
@@ -42,6 +42,14 @@ func TestSanitizeLeakedOutputRemovesDanglingThinkBlock(t *testing.T) {
 	}
 }

+func TestSanitizeLeakedOutputRemovesCompleteDSMLToolCallWrapper(t *testing.T) {
+	raw := "前置文本\n<｜DSML｜tool_calls>\n<｜DSML｜invoke name=\"Bash\">\n<｜DSML｜parameter name=\"command\"></｜DSML｜parameter>\n</｜DSML｜invoke>\n</｜DSML｜tool_calls>\n后置文本"
+	got := sanitizeLeakedOutput(raw)
+	if got != "前置文本\n\n后置文本" {
+		t.Fatalf("unexpected sanitize result for leaked dsml wrapper: %q", got)
+	}
+}
+
 func TestSanitizeLeakedOutputRemovesAgentXMLLeaks(t *testing.T) {
 	raw := "Done.<attempt_completion><result>Some final answer</result></attempt_completion>"
 	got := sanitizeLeakedOutput(raw)
--- a/internal/httpapi/openai/responses/empty_retry_runtime.go
+++ b/internal/httpapi/openai/responses/empty_retry_runtime.go
@@ -18,6 +18,8 @@ import (
 )

 type responsesNonStreamResult struct {
+	rawThinking           string
+	rawText               string
 	thinking              string
 	toolDetectionThinking string
 	text                  string
@@ -27,23 +29,29 @@ type responsesNonStreamResult struct {
 	responseMessageID     int
 }

-func (h *Handler) handleResponsesNonStreamWithRetry(w http.ResponseWriter, ctx context.Context, a *auth.RequestAuth, resp *http.Response, payload map[string]any, pow, owner, responseID, model, finalPrompt string, thinkingEnabled, searchEnabled bool, toolNames []string, toolChoice promptcompat.ToolChoicePolicy, traceID string) {
+func (h *Handler) handleResponsesNonStreamWithRetry(w http.ResponseWriter, ctx context.Context, a *auth.RequestAuth, resp *http.Response, payload map[string]any, pow, owner, responseID, model, finalPrompt string, refFileTokens int, thinkingEnabled, searchEnabled bool, toolNames []string, toolsRaw any, toolChoice promptcompat.ToolChoicePolicy, traceID string) {
 	attempts := 0
 	currentResp := resp
 	usagePrompt := finalPrompt
 	accumulatedThinking := ""
+	accumulatedRawThinking := ""
 	accumulatedToolDetectionThinking := ""
 	for {
-		result, ok := h.collectResponsesNonStreamAttempt(w, currentResp, responseID, model, usagePrompt, thinkingEnabled, searchEnabled, toolNames)
+		result, ok := h.collectResponsesNonStreamAttempt(w, currentResp, responseID, model, usagePrompt, thinkingEnabled, searchEnabled, toolNames, toolsRaw)
 		if !ok {
 			return
 		}
 		accumulatedThinking += sse.TrimContinuationOverlap(accumulatedThinking, result.thinking)
+		accumulatedRawThinking += sse.TrimContinuationOverlap(accumulatedRawThinking, result.rawThinking)
 		accumulatedToolDetectionThinking += sse.TrimContinuationOverlap(accumulatedToolDetectionThinking, result.toolDetectionThinking)
 		result.thinking = accumulatedThinking
+		result.rawThinking = accumulatedRawThinking
 		result.toolDetectionThinking = accumulatedToolDetectionThinking
-		result.parsed = detectAssistantToolCalls(result.text, result.thinking, result.toolDetectionThinking, toolNames)
-		result.body = openaifmt.BuildResponseObjectWithToolCalls(responseID, model, usagePrompt, result.thinking, result.text, result.parsed.Calls)
+		result.parsed = detectAssistantToolCalls(result.rawText, result.text, result.rawThinking, result.toolDetectionThinking, toolNames)
+		result.body = openaifmt.BuildResponseObjectWithToolCalls(responseID, model, usagePrompt, result.thinking, result.text, result.parsed.Calls, toolsRaw)
+		if refFileTokens > 0 {
+			addRefFileTokensToUsage(result.body, refFileTokens)
+		}

 		if !shouldRetryResponsesNonStream(result, attempts) {
 			h.finishResponsesNonStreamResult(w, result, attempts, owner, responseID, toolChoice, traceID)
@@ -63,12 +71,12 @@ func (h *Handler) handleResponsesNonStreamWithRetry(w http.ResponseWriter, ctx c
 			config.Logger.Warn("[openai_empty_retry] retry request failed", "surface", "responses", "stream", false, "retry_attempt", attempts, "error", err)
 			return
 		}
-		usagePrompt = usagePromptWithEmptyOutputRetry(finalPrompt, attempts)
+		usagePrompt = usagePromptWithEmptyOutputRetry(usagePrompt, attempts)
 		currentResp = nextResp
 	}
 }

-func (h *Handler) collectResponsesNonStreamAttempt(w http.ResponseWriter, resp *http.Response, responseID, model, usagePrompt string, thinkingEnabled, searchEnabled bool, toolNames []string) (responsesNonStreamResult, bool) {
+func (h *Handler) collectResponsesNonStreamAttempt(w http.ResponseWriter, resp *http.Response, responseID, model, usagePrompt string, thinkingEnabled, searchEnabled bool, toolNames []string, toolsRaw any) (responsesNonStreamResult, bool) {
 	defer func() { _ = resp.Body.Close() }()
 	if resp.StatusCode != http.StatusOK {
 		body, _ := io.ReadAll(resp.Body)
@@ -78,16 +86,17 @@ func (h *Handler) collectResponsesNonStreamAttempt(w http.ResponseWriter, resp *
 	result := sse.CollectStream(resp, thinkingEnabled, false)
 	stripReferenceMarkers := h.compatStripReferenceMarkers()
 	sanitizedThinking := cleanVisibleOutput(result.Thinking, stripReferenceMarkers)
-	toolDetectionThinking := cleanVisibleOutput(result.ToolDetectionThinking, stripReferenceMarkers)
 	sanitizedText := cleanVisibleOutput(result.Text, stripReferenceMarkers)
 	if searchEnabled {
 		sanitizedText = replaceCitationMarkersWithLinks(sanitizedText, result.CitationLinks)
 	}
-	textParsed := detectAssistantToolCalls(sanitizedText, sanitizedThinking, toolDetectionThinking, toolNames)
-	responseObj := openaifmt.BuildResponseObjectWithToolCalls(responseID, model, usagePrompt, sanitizedThinking, sanitizedText, textParsed.Calls)
+	textParsed := detectAssistantToolCalls(result.Text, sanitizedText, result.Thinking, result.ToolDetectionThinking, toolNames)
+	responseObj := openaifmt.BuildResponseObjectWithToolCalls(responseID, model, usagePrompt, sanitizedThinking, sanitizedText, textParsed.Calls, toolsRaw)
 	return responsesNonStreamResult{
+		rawThinking:           result.Thinking,
+		rawText:               result.Text,
 		thinking:              sanitizedThinking,
-		toolDetectionThinking: toolDetectionThinking,
+		toolDetectionThinking: result.ToolDetectionThinking,
 		text:                  sanitizedText,
 		contentFilter:         result.ContentFilter,
 		parsed:                textParsed,
@@ -123,8 +132,8 @@ func shouldRetryResponsesNonStream(result responsesNonStreamResult, attempts int
 		strings.TrimSpace(result.text) == ""
 }

-func (h *Handler) handleResponsesStreamWithRetry(w http.ResponseWriter, r *http.Request, a *auth.RequestAuth, resp *http.Response, payload map[string]any, pow, owner, responseID, model, finalPrompt string, thinkingEnabled, searchEnabled bool, toolNames []string, toolChoice promptcompat.ToolChoicePolicy, traceID string) {
-	streamRuntime, initialType, ok := h.prepareResponsesStreamRuntime(w, resp, owner, responseID, model, finalPrompt, thinkingEnabled, searchEnabled, toolNames, toolChoice, traceID)
+func (h *Handler) handleResponsesStreamWithRetry(w http.ResponseWriter, r *http.Request, a *auth.RequestAuth, resp *http.Response, payload map[string]any, pow, owner, responseID, model, finalPrompt string, refFileTokens int, thinkingEnabled, searchEnabled bool, toolNames []string, toolsRaw any, toolChoice promptcompat.ToolChoicePolicy, traceID string) {
+	streamRuntime, initialType, ok := h.prepareResponsesStreamRuntime(w, resp, owner, responseID, model, finalPrompt, refFileTokens, thinkingEnabled, searchEnabled, toolNames, toolsRaw, toolChoice, traceID)
 	if !ok {
 		return
 	}
@@ -165,7 +174,7 @@ func (h *Handler) handleResponsesStreamWithRetry(w http.ResponseWriter, r *http.
 	}
 }

-func (h *Handler) prepareResponsesStreamRuntime(w http.ResponseWriter, resp *http.Response, owner, responseID, model, finalPrompt string, thinkingEnabled, searchEnabled bool, toolNames []string, toolChoice promptcompat.ToolChoicePolicy, traceID string) (*responsesStreamRuntime, string, bool) {
+func (h *Handler) prepareResponsesStreamRuntime(w http.ResponseWriter, resp *http.Response, owner, responseID, model, finalPrompt string, refFileTokens int, thinkingEnabled, searchEnabled bool, toolNames []string, toolsRaw any, toolChoice promptcompat.ToolChoicePolicy, traceID string) (*responsesStreamRuntime, string, bool) {
 	if resp.StatusCode != http.StatusOK {
 		defer func() { _ = resp.Body.Close() }()
 		body, _ := io.ReadAll(resp.Body)
@@ -184,12 +193,13 @@ func (h *Handler) prepareResponsesStreamRuntime(w http.ResponseWriter, resp *htt
 	}
 	streamRuntime := newResponsesStreamRuntime(
 		w, rc, canFlush, responseID, model, finalPrompt, thinkingEnabled, searchEnabled,
-		h.compatStripReferenceMarkers(), toolNames, len(toolNames) > 0,
+		h.compatStripReferenceMarkers(), toolNames, toolsRaw, len(toolNames) > 0,
 		h.toolcallFeatureMatchEnabled() && h.toolcallEarlyEmitHighConfidence(),
 		toolChoice, traceID, func(obj map[string]any) {
 			h.getResponseStore().put(owner, responseID, obj)
 		},
 	)
+	streamRuntime.refFileTokens = refFileTokens
 	streamRuntime.sendCreated()
 	return streamRuntime, initialType, true
 }
@@ -212,7 +222,13 @@ func (h *Handler) consumeResponsesStreamAttempt(r *http.Request, resp *http.Resp
 				finalReason = "content_filter"
 			}
 		},
+		OnContextDone: func() {
+			streamRuntime.markContextCancelled()
+		},
 	})
+	if streamRuntime.finalErrorCode == string(streamengine.StopReasonContextCancelled) {
+		return true, false
+	}
 	terminalWritten := streamRuntime.finalize(finalReason, allowDeferEmpty && finalReason != "content_filter")
 	if terminalWritten {
 		return true, false
@@ -225,6 +241,10 @@ func logResponsesStreamTerminal(streamRuntime *responsesStreamRuntime, attempts
 	if attempts > 0 {
 		source = "synthetic_retry"
 	}
+	if streamRuntime.finalErrorCode == string(streamengine.StopReasonContextCancelled) {
+		config.Logger.Info("[openai_empty_retry] terminal cancelled", "surface", "responses", "stream", true, "retry_attempts", attempts, "error_code", streamRuntime.finalErrorCode)
+		return
+	}
 	if streamRuntime.failed {
 		config.Logger.Info("[openai_empty_retry] terminal empty output", "surface", "responses", "stream", true, "retry_attempts", attempts, "success_source", "none", "error_code", streamRuntime.finalErrorCode)
 		return
--- a/internal/httpapi/openai/responses/empty_retry_runtime_test.go
+++ b/internal/httpapi/openai/responses/empty_retry_runtime_test.go
@@ -0,0 +1,70 @@
+package responses
+
+import (
+	"context"
+	"io"
+	"net/http"
+	"net/http/httptest"
+	"strings"
+	"testing"
+
+	"ds2api/internal/promptcompat"
+	"ds2api/internal/stream"
+)
+
+func makeResponsesOpenAISSEHTTPResponse(lines ...string) *http.Response {
+	body := strings.Join(lines, "\n")
+	if !strings.HasSuffix(body, "\n") {
+		body += "\n"
+	}
+	return &http.Response{
+		StatusCode: http.StatusOK,
+		Header:     make(http.Header),
+		Body:       io.NopCloser(strings.NewReader(body)),
+	}
+}
+
+func TestConsumeResponsesStreamAttemptMarksContextCancelledState(t *testing.T) {
+	ctx, cancel := context.WithCancel(context.Background())
+	cancel()
+
+	req := httptest.NewRequest(http.MethodPost, "/v1/responses", nil).WithContext(ctx)
+	rec := httptest.NewRecorder()
+	streamRuntime := newResponsesStreamRuntime(
+		rec,
+		http.NewResponseController(rec),
+		true,
+		"resp-cancelled",
+		"deepseek-v4-flash",
+		"prompt",
+		false,
+		false,
+		true,
+		nil,
+		nil,
+		false,
+		false,
+		promptcompat.DefaultToolChoicePolicy(),
+		"",
+		nil,
+	)
+	resp := makeResponsesOpenAISSEHTTPResponse(
+		`data: {"p":"response/content","v":"hello"}`,
+		`data: [DONE]`,
+	)
+
+	h := &Handler{}
+	terminalWritten, retryable := h.consumeResponsesStreamAttempt(req, resp, streamRuntime, "text", false, true)
+	if !terminalWritten || retryable {
+		t.Fatalf("expected cancelled attempt to terminate without retry, got terminalWritten=%v retryable=%v", terminalWritten, retryable)
+	}
+	if !streamRuntime.failed {
+		t.Fatalf("expected cancelled response stream to be marked failed")
+	}
+	if got, want := streamRuntime.finalErrorCode, string(stream.StopReasonContextCancelled); got != want {
+		t.Fatalf("expected cancelled final error code %q, got %q", want, got)
+	}
+	if streamRuntime.finalErrorMessage == "" {
+		t.Fatalf("expected cancelled final error message to be preserved")
+	}
+}
--- a/internal/httpapi/openai/responses/handler.go
+++ b/internal/httpapi/openai/responses/handler.go
@@ -130,6 +130,6 @@ func filterIncrementalToolCallDeltasByAllowed(deltas []toolstream.ToolCallDelta,
 	return shared.FilterIncrementalToolCallDeltasByAllowed(deltas, seenNames)
 }

-func detectAssistantToolCalls(text, exposedThinking, detectionThinking string, toolNames []string) toolcall.ToolCallParseResult {
-	return shared.DetectAssistantToolCalls(text, exposedThinking, detectionThinking, toolNames)
+func detectAssistantToolCalls(rawText, visibleText, exposedThinking, detectionThinking string, toolNames []string) toolcall.ToolCallParseResult {
+	return shared.DetectAssistantToolCalls(rawText, visibleText, exposedThinking, detectionThinking, toolNames)
 }
--- a/internal/httpapi/openai/responses/ref_file_tokens.go
+++ b/internal/httpapi/openai/responses/ref_file_tokens.go
@@ -0,0 +1,26 @@
+package responses
+
+// addRefFileTokensToUsage adds inline-uploaded file token estimates to an existing
+// usage map inside a response object. This keeps the token accounting aware of file
+// content that the upstream model processes but that is not part of the prompt text.
+func addRefFileTokensToUsage(obj map[string]any, refFileTokens int) {
+	if refFileTokens <= 0 || obj == nil {
+		return
+	}
+	usage, ok := obj["usage"].(map[string]any)
+	if !ok || usage == nil {
+		return
+	}
+	for _, key := range []string{"input_tokens", "prompt_tokens"} {
+		if v, ok := usage[key]; ok {
+			if n, ok := v.(int); ok {
+				usage[key] = n + refFileTokens
+			}
+		}
+	}
+	if v, ok := usage["total_tokens"]; ok {
+		if n, ok := v.(int); ok {
+			usage["total_tokens"] = n + refFileTokens
+		}
+	}
+}
--- a/internal/httpapi/openai/responses/responses_handler.go
+++ b/internal/httpapi/openai/responses/responses_handler.go
@@ -114,14 +114,15 @@ func (h *Handler) Responses(w http.ResponseWriter, r *http.Request) {
 	}

 	responseID := "resp_" + strings.ReplaceAll(uuid.NewString(), "-", "")
+	refFileTokens := stdReq.RefFileTokens
 	if stdReq.Stream {
-		h.handleResponsesStreamWithRetry(w, r, a, resp, payload, pow, owner, responseID, stdReq.ResponseModel, stdReq.FinalPrompt, stdReq.Thinking, stdReq.Search, stdReq.ToolNames, stdReq.ToolChoice, traceID)
+		h.handleResponsesStreamWithRetry(w, r, a, resp, payload, pow, owner, responseID, stdReq.ResponseModel, stdReq.PromptTokenText, refFileTokens, stdReq.Thinking, stdReq.Search, stdReq.ToolNames, stdReq.ToolsRaw, stdReq.ToolChoice, traceID)
 		return
 	}
-	h.handleResponsesNonStreamWithRetry(w, r.Context(), a, resp, payload, pow, owner, responseID, stdReq.ResponseModel, stdReq.FinalPrompt, stdReq.Thinking, stdReq.Search, stdReq.ToolNames, stdReq.ToolChoice, traceID)
+	h.handleResponsesNonStreamWithRetry(w, r.Context(), a, resp, payload, pow, owner, responseID, stdReq.ResponseModel, stdReq.PromptTokenText, refFileTokens, stdReq.Thinking, stdReq.Search, stdReq.ToolNames, stdReq.ToolsRaw, stdReq.ToolChoice, traceID)
 }

-func (h *Handler) handleResponsesNonStream(w http.ResponseWriter, resp *http.Response, owner, responseID, model, finalPrompt string, thinkingEnabled, searchEnabled bool, toolNames []string, toolChoice promptcompat.ToolChoicePolicy, traceID string) {
+func (h *Handler) handleResponsesNonStream(w http.ResponseWriter, resp *http.Response, owner, responseID, model, finalPrompt string, refFileTokens int, thinkingEnabled, searchEnabled bool, toolNames []string, toolsRaw any, toolChoice promptcompat.ToolChoicePolicy, traceID string) {
 	defer func() { _ = resp.Body.Close() }()
 	if resp.StatusCode != http.StatusOK {
 		body, _ := io.ReadAll(resp.Body)
@@ -131,12 +132,11 @@ func (h *Handler) handleResponsesNonStream(w http.ResponseWriter, resp *http.Res
 	result := sse.CollectStream(resp, thinkingEnabled, true)
 	stripReferenceMarkers := h.compatStripReferenceMarkers()
 	sanitizedThinking := cleanVisibleOutput(result.Thinking, stripReferenceMarkers)
-	toolDetectionThinking := cleanVisibleOutput(result.ToolDetectionThinking, stripReferenceMarkers)
 	sanitizedText := cleanVisibleOutput(result.Text, stripReferenceMarkers)
 	if searchEnabled {
 		sanitizedText = replaceCitationMarkersWithLinks(sanitizedText, result.CitationLinks)
 	}
-	textParsed := detectAssistantToolCalls(sanitizedText, sanitizedThinking, toolDetectionThinking, toolNames)
+	textParsed := detectAssistantToolCalls(result.Text, sanitizedText, result.Thinking, result.ToolDetectionThinking, toolNames)
 	if len(textParsed.Calls) == 0 && writeUpstreamEmptyOutputError(w, sanitizedText, sanitizedThinking, result.ContentFilter) {
 		return
 	}
@@ -148,12 +148,15 @@ func (h *Handler) handleResponsesNonStream(w http.ResponseWriter, resp *http.Res
 		return
 	}

-	responseObj := openaifmt.BuildResponseObjectWithToolCalls(responseID, model, finalPrompt, sanitizedThinking, sanitizedText, textParsed.Calls)
+	responseObj := openaifmt.BuildResponseObjectWithToolCalls(responseID, model, finalPrompt, sanitizedThinking, sanitizedText, textParsed.Calls, toolsRaw)
+	if refFileTokens > 0 {
+		addRefFileTokensToUsage(responseObj, refFileTokens)
+	}
 	h.getResponseStore().put(owner, responseID, responseObj)
 	writeJSON(w, http.StatusOK, responseObj)
 }

-func (h *Handler) handleResponsesStream(w http.ResponseWriter, r *http.Request, resp *http.Response, owner, responseID, model, finalPrompt string, thinkingEnabled, searchEnabled bool, toolNames []string, toolChoice promptcompat.ToolChoicePolicy, traceID string) {
+func (h *Handler) handleResponsesStream(w http.ResponseWriter, r *http.Request, resp *http.Response, owner, responseID, model, finalPrompt string, refFileTokens int, thinkingEnabled, searchEnabled bool, toolNames []string, toolsRaw any, toolChoice promptcompat.ToolChoicePolicy, traceID string) {
 	defer func() { _ = resp.Body.Close() }()
 	if resp.StatusCode != http.StatusOK {
 		body, _ := io.ReadAll(resp.Body)
@@ -186,6 +189,7 @@ func (h *Handler) handleResponsesStream(w http.ResponseWriter, r *http.Request,
 		searchEnabled,
 		stripReferenceMarkers,
 		toolNames,
+		toolsRaw,
 		bufferToolContent,
 		emitEarlyToolDeltas,
 		toolChoice,
@@ -194,6 +198,7 @@ func (h *Handler) handleResponsesStream(w http.ResponseWriter, r *http.Request,
 			h.getResponseStore().put(owner, responseID, obj)
 		},
 	)
+	streamRuntime.refFileTokens = refFileTokens
 	streamRuntime.sendCreated()

 	streamengine.ConsumeSSE(streamengine.ConsumeConfig{
--- a/internal/httpapi/openai/responses/responses_stream_delta_batch.go
+++ b/internal/httpapi/openai/responses/responses_stream_delta_batch.go
@@ -0,0 +1,39 @@
+package responses
+
+import (
+	"strings"
+
+	openaifmt "ds2api/internal/format/openai"
+)
+
+type responsesDeltaBatch struct {
+	runtime *responsesStreamRuntime
+	kind    string
+	text    strings.Builder
+}
+
+func (b *responsesDeltaBatch) append(kind, text string) {
+	if text == "" {
+		return
+	}
+	if b.kind != "" && b.kind != kind {
+		b.flush()
+	}
+	b.kind = kind
+	b.text.WriteString(text)
+}
+
+func (b *responsesDeltaBatch) flush() {
+	if b.kind == "" || b.text.Len() == 0 {
+		return
+	}
+	text := b.text.String()
+	switch b.kind {
+	case "reasoning":
+		b.runtime.sendEvent("response.reasoning.delta", openaifmt.BuildResponsesReasoningDeltaPayload(b.runtime.responseID, text))
+	case "text":
+		b.runtime.emitTextDelta(text)
+	}
+	b.kind = ""
+	b.text.Reset()
+}
--- a/internal/httpapi/openai/responses/responses_stream_runtime_core.go
+++ b/internal/httpapi/openai/responses/responses_stream_runtime_core.go
@@ -18,12 +18,14 @@ type responsesStreamRuntime struct {
 	rc       *http.ResponseController
 	canFlush bool

-	responseID  string
-	model       string
-	finalPrompt string
-	toolNames   []string
-	traceID     string
-	toolChoice  promptcompat.ToolChoicePolicy
+	responseID    string
+	model         string
+	finalPrompt   string
+	refFileTokens int
+	toolNames     []string
+	toolsRaw      any
+	traceID       string
+	toolChoice    promptcompat.ToolChoicePolicy

 	thinkingEnabled       bool
 	searchEnabled         bool
@@ -35,8 +37,10 @@ type responsesStreamRuntime struct {
 	toolCallsDoneEmitted bool

 	sieve                 toolstream.State
+	rawThinking           strings.Builder
 	thinking              strings.Builder
 	toolDetectionThinking strings.Builder
+	rawText               strings.Builder
 	text                  strings.Builder
 	visibleText           strings.Builder
 	responseMessageID     int
@@ -72,6 +76,7 @@ func newResponsesStreamRuntime(
 	searchEnabled bool,
 	stripReferenceMarkers bool,
 	toolNames []string,
+	toolsRaw any,
 	bufferToolContent bool,
 	emitEarlyToolDeltas bool,
 	toolChoice promptcompat.ToolChoicePolicy,
@@ -89,6 +94,7 @@ func newResponsesStreamRuntime(
 		searchEnabled:         searchEnabled,
 		stripReferenceMarkers: stripReferenceMarkers,
 		toolNames:             toolNames,
+		toolsRaw:              toolsRaw,
 		bufferToolContent:     bufferToolContent,
 		emitEarlyToolDeltas:   emitEarlyToolDeltas,
 		streamToolCallIDs:     map[int]string{},
@@ -133,20 +139,26 @@ func (s *responsesStreamRuntime) failResponse(status int, message, code string)
 	s.sendDone()
 }

+func (s *responsesStreamRuntime) markContextCancelled() {
+	s.failed = true
+	s.finalErrorStatus = 499
+	s.finalErrorMessage = "request context cancelled"
+	s.finalErrorCode = string(streamengine.StopReasonContextCancelled)
+}
+
 func (s *responsesStreamRuntime) finalize(finishReason string, deferEmptyOutput bool) bool {
 	s.failed = false
 	s.finalErrorStatus = 0
 	s.finalErrorMessage = ""
 	s.finalErrorCode = ""
-	finalThinking := s.thinking.String()
-	finalToolDetectionThinking := s.toolDetectionThinking.String()
-	finalText := cleanVisibleOutput(s.text.String(), s.stripReferenceMarkers)
-
 	if s.bufferToolContent {
 		s.processToolStreamEvents(toolstream.Flush(&s.sieve, s.toolNames), true, true)
 	}

-	textParsed := detectAssistantToolCalls(finalText, finalThinking, finalToolDetectionThinking, s.toolNames)
+	finalThinking := s.thinking.String()
+	finalToolDetectionThinking := s.toolDetectionThinking.String()
+	finalText := cleanVisibleOutput(s.text.String(), s.stripReferenceMarkers)
+	textParsed := detectAssistantToolCalls(s.rawText.String(), finalText, s.rawThinking.String(), finalToolDetectionThinking, s.toolNames)
 	detected := textParsed.Calls
 	s.logToolPolicyRejections(textParsed)

@@ -217,6 +229,7 @@ func (s *responsesStreamRuntime) onParsed(parsed sse.LineResult) streamengine.Pa
 	}

 	contentSeen := false
+	batch := responsesDeltaBatch{runtime: s}
 	for _, p := range parsed.ToolDetectionThinkingParts {
 		trimmed := sse.TrimContinuationOverlap(s.toolDetectionThinking.String(), p.Text)
 		if trimmed != "" {
@@ -224,38 +237,53 @@ func (s *responsesStreamRuntime) onParsed(parsed sse.LineResult) streamengine.Pa
 		}
 	}
 	for _, p := range parsed.Parts {
-		cleanedText := cleanVisibleOutput(p.Text, s.stripReferenceMarkers)
-		if cleanedText == "" {
-			continue
-		}
-		if p.Type != "thinking" && s.searchEnabled && sse.IsCitation(cleanedText) {
-			continue
-		}
-		contentSeen = true
 		if p.Type == "thinking" {
+			rawTrimmed := sse.TrimContinuationOverlap(s.rawThinking.String(), p.Text)
+			if rawTrimmed != "" {
+				s.rawThinking.WriteString(rawTrimmed)
+				contentSeen = true
+			}
 			if !s.thinkingEnabled {
 				continue
 			}
+			cleanedText := cleanVisibleOutput(rawTrimmed, s.stripReferenceMarkers)
+			if cleanedText == "" {
+				continue
+			}
 			trimmed := sse.TrimContinuationOverlap(s.thinking.String(), cleanedText)
 			if trimmed == "" {
 				continue
 			}
 			s.thinking.WriteString(trimmed)
-			s.sendEvent("response.reasoning.delta", openaifmt.BuildResponsesReasoningDeltaPayload(s.responseID, trimmed))
+			batch.append("reasoning", trimmed)
 			continue
 		}

+		rawTrimmed := sse.TrimContinuationOverlap(s.rawText.String(), p.Text)
+		if rawTrimmed == "" {
+			continue
+		}
+		s.rawText.WriteString(rawTrimmed)
+		contentSeen = true
+		cleanedText := cleanVisibleOutput(rawTrimmed, s.stripReferenceMarkers)
+		if s.searchEnabled && sse.IsCitation(cleanedText) {
+			continue
+		}
 		trimmed := sse.TrimContinuationOverlap(s.text.String(), cleanedText)
-		if trimmed == "" {
-			continue
+		if trimmed != "" {
+			s.text.WriteString(trimmed)
 		}
-		s.text.WriteString(trimmed)
 		if !s.bufferToolContent {
-			s.emitTextDelta(trimmed)
+			if trimmed == "" {
+				continue
+			}
+			batch.append("text", trimmed)
 			continue
 		}
-		s.processToolStreamEvents(toolstream.ProcessChunk(&s.sieve, trimmed, s.toolNames), true, true)
+		batch.flush()
+		s.processToolStreamEvents(toolstream.ProcessChunk(&s.sieve, rawTrimmed, s.toolNames), true, true)
 	}

+	batch.flush()
 	return streamengine.ParsedDecision{ContentSeen: contentSeen}
 }
--- a/internal/httpapi/openai/responses/responses_stream_runtime_events.go
+++ b/internal/httpapi/openai/responses/responses_stream_runtime_events.go
@@ -4,6 +4,7 @@ import (
 	"encoding/json"

 	openaifmt "ds2api/internal/format/openai"
+	"ds2api/internal/sse"
 	"ds2api/internal/toolstream"
 )

@@ -43,7 +44,10 @@ func (s *responsesStreamRuntime) sendDone() {
 func (s *responsesStreamRuntime) processToolStreamEvents(events []toolstream.Event, emitContent bool, resetAfterToolCalls bool) {
 	for _, evt := range events {
 		if emitContent && evt.Content != "" {
-			s.emitTextDelta(evt.Content)
+			cleaned := cleanVisibleOutput(evt.Content, s.stripReferenceMarkers)
+			if cleaned != "" && (!s.searchEnabled || !sse.IsCitation(cleaned)) {
+				s.emitTextDelta(cleaned)
+			}
 		}
 		if len(evt.ToolCallDeltas) > 0 {
 			if !s.emitEarlyToolDeltas {
--- a/internal/httpapi/openai/responses/responses_stream_runtime_toolcalls.go
+++ b/internal/httpapi/openai/responses/responses_stream_runtime_toolcalls.go
@@ -220,7 +220,8 @@ func (s *responsesStreamRuntime) emitFunctionCallDeltaEvents(deltas []toolstream
 }

 func (s *responsesStreamRuntime) emitFunctionCallDoneEvents(calls []toolcall.ParsedToolCall) {
-	for idx, tc := range calls {
+	normalizedCalls := toolcall.NormalizeParsedToolCallsForSchemas(calls, s.toolsRaw)
+	for idx, tc := range normalizedCalls {
 		if strings.TrimSpace(tc.Name) == "" {
 			continue
 		}
--- a/internal/httpapi/openai/responses/responses_stream_runtime_toolcalls_finalize.go
+++ b/internal/httpapi/openai/responses/responses_stream_runtime_toolcalls_finalize.go
@@ -109,7 +109,8 @@ func (s *responsesStreamRuntime) buildCompletedResponseObject(finalThinking, fin
 		}
 	}

-	for idx, tc := range calls {
+	normalizedCalls := toolcall.NormalizeParsedToolCallsForSchemas(calls, s.toolsRaw)
+	for idx, tc := range normalizedCalls {
 		if strings.TrimSpace(tc.Name) == "" {
 			continue
 		}
@@ -144,7 +145,7 @@ func (s *responsesStreamRuntime) buildCompletedResponseObject(finalThinking, fin
 		}
 	}

-	return openaifmt.BuildResponseObjectFromItems(
+	obj := openaifmt.BuildResponseObjectFromItems(
 		s.responseID,
 		s.model,
 		s.finalPrompt,
@@ -153,4 +154,8 @@ func (s *responsesStreamRuntime) buildCompletedResponseObject(finalThinking, fin
 		output,
 		outputText,
 	)
+	if s.refFileTokens > 0 {
+		addRefFileTokensToUsage(obj, s.refFileTokens)
+	}
+	return obj
 }
--- a/internal/httpapi/openai/responses/responses_stream_test.go
+++ b/internal/httpapi/openai/responses/responses_stream_test.go
@@ -27,7 +27,7 @@ func TestHandleResponsesStreamDoesNotEmitReasoningTextCompatEvents(t *testing.T)
 		Body:       io.NopCloser(strings.NewReader(streamBody)),
 	}

-	h.handleResponsesStream(rec, req, resp, "owner-a", "resp_test", "deepseek-v4-pro", "prompt", true, false, nil, promptcompat.DefaultToolChoicePolicy(), "")
+	h.handleResponsesStream(rec, req, resp, "owner-a", "resp_test", "deepseek-v4-pro", "prompt", 0, true, false, nil, nil, promptcompat.DefaultToolChoicePolicy(), "")

 	body := rec.Body.String()
 	if !strings.Contains(body, "event: response.reasoning.delta") {
@@ -57,7 +57,7 @@ func TestHandleResponsesStreamEmitsOutputTextDoneBeforeContentPartDone(t *testin
 		Body:       io.NopCloser(strings.NewReader(streamBody)),
 	}

-	h.handleResponsesStream(rec, req, resp, "owner-a", "resp_test", "deepseek-v4-flash", "prompt", false, false, nil, promptcompat.DefaultToolChoicePolicy(), "")
+	h.handleResponsesStream(rec, req, resp, "owner-a", "resp_test", "deepseek-v4-flash", "prompt", 0, false, false, nil, nil, promptcompat.DefaultToolChoicePolicy(), "")
 	body := rec.Body.String()
 	if !strings.Contains(body, "event: response.output_text.done") {
 		t.Fatalf("expected response.output_text.done payload, body=%s", body)
@@ -91,7 +91,7 @@ func TestHandleResponsesStreamOutputTextDeltaCarriesItemIndexes(t *testing.T) {
 		Body:       io.NopCloser(strings.NewReader(streamBody)),
 	}

-	h.handleResponsesStream(rec, req, resp, "owner-a", "resp_test", "deepseek-v4-flash", "prompt", false, false, nil, promptcompat.DefaultToolChoicePolicy(), "")
+	h.handleResponsesStream(rec, req, resp, "owner-a", "resp_test", "deepseek-v4-flash", "prompt", 0, false, false, nil, nil, promptcompat.DefaultToolChoicePolicy(), "")
 	body := rec.Body.String()

 	deltaPayload, ok := extractSSEEventPayload(body, "response.output_text.delta")
@@ -109,6 +109,48 @@ func TestHandleResponsesStreamOutputTextDeltaCarriesItemIndexes(t *testing.T) {
 	}
 }

+func TestHandleResponsesStreamCoalescesSmallOutputTextDeltas(t *testing.T) {
+	h := &Handler{}
+	req := httptest.NewRequest(http.MethodPost, "/v1/responses", nil)
+	rec := httptest.NewRecorder()
+
+	var streamBody strings.Builder
+	for i := 0; i < 100; i++ {
+		b, _ := json.Marshal(map[string]any{
+			"p": "response/content",
+			"v": "字",
+		})
+		streamBody.WriteString("data: ")
+		streamBody.WriteString(string(b))
+		streamBody.WriteString("\n")
+	}
+	streamBody.WriteString("data: [DONE]\n")
+	resp := &http.Response{
+		StatusCode: http.StatusOK,
+		Body:       io.NopCloser(strings.NewReader(streamBody.String())),
+	}
+
+	h.handleResponsesStream(rec, req, resp, "owner-a", "resp_coalesce", "deepseek-v4-flash", "prompt", 0, false, false, nil, nil, promptcompat.DefaultToolChoicePolicy(), "")
+
+	payloads := extractSSEEventPayloads(rec.Body.String(), "response.output_text.delta")
+	if len(payloads) == 0 {
+		t.Fatalf("expected response.output_text.delta payloads, body=%s", rec.Body.String())
+	}
+	var content strings.Builder
+	for _, payload := range payloads {
+		content.WriteString(asString(payload["delta"]))
+	}
+	if got, want := content.String(), strings.Repeat("字", 100); got != want {
+		t.Fatalf("coalesced response content mismatch: got %q want %q body=%s", got, want, rec.Body.String())
+	}
+	if len(payloads) >= 100 {
+		t.Fatalf("expected coalescing to reduce 100 tiny text deltas, got %d body=%s", len(payloads), rec.Body.String())
+	}
+	if !strings.Contains(rec.Body.String(), "event: response.completed") {
+		t.Fatalf("expected completed event, body=%s", rec.Body.String())
+	}
+}
+
 func TestHandleResponsesStreamEmitsDistinctToolCallIDsAcrossSeparateToolBlocks(t *testing.T) {
 	h := &Handler{}
 	req := httptest.NewRequest(http.MethodPost, "/v1/responses", nil)
@@ -130,7 +172,7 @@ func TestHandleResponsesStreamEmitsDistinctToolCallIDsAcrossSeparateToolBlocks(t
 		Body:       io.NopCloser(strings.NewReader(streamBody)),
 	}

-	h.handleResponsesStream(rec, req, resp, "owner-a", "resp_test", "deepseek-v4-flash", "prompt", false, false, []string{"read_file", "search"}, promptcompat.DefaultToolChoicePolicy(), "")
+	h.handleResponsesStream(rec, req, resp, "owner-a", "resp_test", "deepseek-v4-flash", "prompt", 0, false, false, []string{"read_file", "search"}, nil, promptcompat.DefaultToolChoicePolicy(), "")

 	body := rec.Body.String()
 	doneEvents := extractSSEEventPayloads(body, "response.function_call_arguments.done")
@@ -183,7 +225,7 @@ func TestHandleResponsesStreamRequiredToolChoiceFailure(t *testing.T) {
 		Mode:    promptcompat.ToolChoiceRequired,
 		Allowed: map[string]struct{}{"read_file": {}},
 	}
-	h.handleResponsesStream(rec, req, resp, "owner-a", "resp_test", "deepseek-v4-flash", "prompt", false, false, []string{"read_file"}, policy, "")
+	h.handleResponsesStream(rec, req, resp, "owner-a", "resp_test", "deepseek-v4-flash", "prompt", 0, false, false, []string{"read_file"}, nil, policy, "")

 	body := rec.Body.String()
 	if !strings.Contains(body, "event: response.failed") {
@@ -213,7 +255,7 @@ func TestHandleResponsesStreamFailsWhenUpstreamHasOnlyThinking(t *testing.T) {
 		Body:       io.NopCloser(strings.NewReader(streamBody)),
 	}

-	h.handleResponsesStream(rec, req, resp, "owner-a", "resp_test", "deepseek-v4-pro", "prompt", true, false, nil, promptcompat.DefaultToolChoicePolicy(), "")
+	h.handleResponsesStream(rec, req, resp, "owner-a", "resp_test", "deepseek-v4-pro", "prompt", 0, true, false, nil, nil, promptcompat.DefaultToolChoicePolicy(), "")

 	body := rec.Body.String()
 	if !strings.Contains(body, "event: response.failed") {
@@ -251,11 +293,11 @@ func TestHandleResponsesStreamPromotesThinkingToolCallsOnFinalizeWithoutMidstrea
 		Body:       io.NopCloser(strings.NewReader(streamBody)),
 	}

-	h.handleResponsesStream(rec, req, resp, "owner-a", "resp_test", "deepseek-v4-pro", "prompt", true, false, []string{"read_file"}, promptcompat.DefaultToolChoicePolicy(), "")
+	h.handleResponsesStream(rec, req, resp, "owner-a", "resp_test", "deepseek-v4-pro", "prompt", 0, true, false, []string{"read_file"}, nil, promptcompat.DefaultToolChoicePolicy(), "")

 	body := rec.Body.String()
-	if !strings.Contains(body, "event: response.reasoning.delta") {
-		t.Fatalf("expected reasoning delta in stream body, got %s", body)
+	if strings.Contains(body, "event: response.reasoning.delta") {
+		t.Fatalf("did not expect leaked reasoning delta in stream body, got %s", body)
 	}
 	if !strings.Contains(body, "event: response.function_call_arguments.done") {
 		t.Fatalf("expected finalize fallback function call event, got %s", body)
@@ -288,7 +330,7 @@ func TestHandleResponsesStreamPromotesHiddenThinkingDSMLToolCallsOnFinalize(t *t
 		Mode:    promptcompat.ToolChoiceRequired,
 		Allowed: map[string]struct{}{"read_file": {}},
 	}
-	h.handleResponsesStream(rec, req, resp, "owner-a", "resp_hidden", "deepseek-v4-pro", "prompt", false, false, []string{"read_file"}, policy, "")
+	h.handleResponsesStream(rec, req, resp, "owner-a", "resp_hidden", "deepseek-v4-pro", "prompt", 0, false, false, []string{"read_file"}, nil, policy, "")

 	body := rec.Body.String()
 	if strings.Contains(body, "event: response.reasoning.delta") {
@@ -317,7 +359,7 @@ func TestHandleResponsesNonStreamRequiredToolChoiceViolation(t *testing.T) {
 		Allowed: map[string]struct{}{"read_file": {}},
 	}

-	h.handleResponsesNonStream(rec, resp, "owner-a", "resp_test", "deepseek-v4-flash", "prompt", false, false, []string{"read_file"}, policy, "")
+	h.handleResponsesNonStream(rec, resp, "owner-a", "resp_test", "deepseek-v4-flash", "prompt", 0, false, false, []string{"read_file"}, nil, policy, "")
 	if rec.Code != http.StatusUnprocessableEntity {
 		t.Fatalf("expected 422 for required tool_choice violation, got %d body=%s", rec.Code, rec.Body.String())
 	}
@@ -344,7 +386,7 @@ func TestHandleResponsesNonStreamRequiredToolChoiceIgnoresThinkingToolPayloadWhe
 		Allowed: map[string]struct{}{"read_file": {}},
 	}

-	h.handleResponsesNonStream(rec, resp, "owner-a", "resp_test", "deepseek-v4-flash", "prompt", true, false, []string{"read_file"}, policy, "")
+	h.handleResponsesNonStream(rec, resp, "owner-a", "resp_test", "deepseek-v4-flash", "prompt", 0, true, false, []string{"read_file"}, nil, policy, "")
 	if rec.Code != http.StatusUnprocessableEntity {
 		t.Fatalf("expected 422 for required tool_choice violation, got %d body=%s", rec.Code, rec.Body.String())
 	}
@@ -366,7 +408,7 @@ func TestHandleResponsesNonStreamReturns429WhenUpstreamOutputEmpty(t *testing.T)
 		)),
 	}

-	h.handleResponsesNonStream(rec, resp, "owner-a", "resp_test", "deepseek-v4-flash", "prompt", false, false, nil, promptcompat.DefaultToolChoicePolicy(), "")
+	h.handleResponsesNonStream(rec, resp, "owner-a", "resp_test", "deepseek-v4-flash", "prompt", 0, false, false, nil, nil, promptcompat.DefaultToolChoicePolicy(), "")
 	if rec.Code != http.StatusTooManyRequests {
 		t.Fatalf("expected 429 for empty upstream output, got %d body=%s", rec.Code, rec.Body.String())
 	}
@@ -388,7 +430,7 @@ func TestHandleResponsesNonStreamReturnsContentFilterErrorWhenUpstreamFilteredWi
 		)),
 	}

-	h.handleResponsesNonStream(rec, resp, "owner-a", "resp_test", "deepseek-v4-flash", "prompt", false, false, nil, promptcompat.DefaultToolChoicePolicy(), "")
+	h.handleResponsesNonStream(rec, resp, "owner-a", "resp_test", "deepseek-v4-flash", "prompt", 0, false, false, nil, nil, promptcompat.DefaultToolChoicePolicy(), "")
 	if rec.Code != http.StatusBadRequest {
 		t.Fatalf("expected 400 for filtered empty upstream output, got %d body=%s", rec.Code, rec.Body.String())
 	}
@@ -410,7 +452,7 @@ func TestHandleResponsesNonStreamReturns429WhenUpstreamHasOnlyThinking(t *testin
 		)),
 	}

-	h.handleResponsesNonStream(rec, resp, "owner-a", "resp_test", "deepseek-v4-pro", "prompt", true, false, nil, promptcompat.DefaultToolChoicePolicy(), "")
+	h.handleResponsesNonStream(rec, resp, "owner-a", "resp_test", "deepseek-v4-pro", "prompt", 0, true, false, nil, nil, promptcompat.DefaultToolChoicePolicy(), "")
 	if rec.Code != http.StatusTooManyRequests {
 		t.Fatalf("expected 429 for thinking-only upstream output, got %d body=%s", rec.Code, rec.Body.String())
 	}
@@ -432,7 +474,7 @@ func TestHandleResponsesNonStreamPromotesThinkingToolCallsWhenTextEmpty(t *testi
 		)),
 	}

-	h.handleResponsesNonStream(rec, resp, "owner-a", "resp_test", "deepseek-v4-pro", "prompt", true, false, []string{"read_file"}, promptcompat.DefaultToolChoicePolicy(), "")
+	h.handleResponsesNonStream(rec, resp, "owner-a", "resp_test", "deepseek-v4-pro", "prompt", 0, true, false, []string{"read_file"}, nil, promptcompat.DefaultToolChoicePolicy(), "")
 	if rec.Code != http.StatusOK {
 		t.Fatalf("expected 200 for thinking tool calls, got %d body=%s", rec.Code, rec.Body.String())
 	}
@@ -462,7 +504,7 @@ func TestHandleResponsesNonStreamPromotesHiddenThinkingDSMLToolCallsWhenTextEmpt
 		Mode:    promptcompat.ToolChoiceRequired,
 		Allowed: map[string]struct{}{"read_file": {}},
 	}
-	h.handleResponsesNonStream(rec, resp, "owner-a", "resp_hidden", "deepseek-v4-pro", "prompt", false, false, []string{"read_file"}, policy, "")
+	h.handleResponsesNonStream(rec, resp, "owner-a", "resp_hidden", "deepseek-v4-pro", "prompt", 0, false, false, []string{"read_file"}, nil, policy, "")
 	if rec.Code != http.StatusOK {
 		t.Fatalf("expected 200 for hidden thinking tool calls, got %d body=%s", rec.Code, rec.Body.String())
 	}
@@ -480,6 +522,53 @@ func TestHandleResponsesNonStreamPromotesHiddenThinkingDSMLToolCallsWhenTextEmpt
 	}
 }

+func TestHandleResponsesStreamCoercesSchemaDeclaredStringArguments(t *testing.T) {
+	h := &Handler{}
+	req := httptest.NewRequest(http.MethodPost, "/v1/responses", nil)
+	rec := httptest.NewRecorder()
+	toolsRaw := []any{
+		map[string]any{
+			"type": "function",
+			"function": map[string]any{
+				"name": "Write",
+				"parameters": map[string]any{
+					"type": "object",
+					"properties": map[string]any{
+						"content": map[string]any{"type": "string"},
+						"taskId":  map[string]any{"type": "string"},
+					},
+				},
+			},
+		},
+	}
+	sseLine := func(v string) string {
+		b, _ := json.Marshal(map[string]any{"p": "response/content", "v": v})
+		return "data: " + string(b) + "\n"
+	}
+	streamBody := sseLine(`<tool_calls><invoke name="Write">{"input":{"content":{"message":"hi"},"taskId":1}}</invoke></tool_calls>`) + "data: [DONE]\n"
+	resp := &http.Response{
+		StatusCode: http.StatusOK,
+		Body:       io.NopCloser(strings.NewReader(streamBody)),
+	}
+
+	h.handleResponsesStream(rec, req, resp, "owner-a", "resp_string_protect", "deepseek-v4-flash", "prompt", 0, false, false, []string{"Write"}, toolsRaw, promptcompat.DefaultToolChoicePolicy(), "")
+
+	payload, ok := extractSSEEventPayload(rec.Body.String(), "response.function_call_arguments.done")
+	if !ok {
+		t.Fatalf("expected response.function_call_arguments.done payload, body=%s", rec.Body.String())
+	}
+	args := map[string]any{}
+	if err := json.Unmarshal([]byte(asString(payload["arguments"])), &args); err != nil {
+		t.Fatalf("decode streamed response arguments failed: %v", err)
+	}
+	if args["content"] != `{"message":"hi"}` {
+		t.Fatalf("expected response content stringified by schema, got %#v", args["content"])
+	}
+	if args["taskId"] != "1" {
+		t.Fatalf("expected response taskId stringified by schema, got %#v", args["taskId"])
+	}
+}
+
 func extractSSEEventPayload(body, targetEvent string) (map[string]any, bool) {
 	scanner := bufio.NewScanner(strings.NewReader(body))
 	matched := false
--- a/internal/httpapi/openai/shared/assistant_toolcalls.go
+++ b/internal/httpapi/openai/shared/assistant_toolcalls.go
@@ -6,12 +6,12 @@ import (
 	"ds2api/internal/toolcall"
 )

-func DetectAssistantToolCalls(text, exposedThinking, detectionThinking string, toolNames []string) toolcall.ToolCallParseResult {
-	textParsed := toolcall.ParseStandaloneToolCallsDetailed(text, toolNames)
+func DetectAssistantToolCalls(rawText, visibleText, exposedThinking, detectionThinking string, toolNames []string) toolcall.ToolCallParseResult {
+	textParsed := toolcall.ParseStandaloneToolCallsDetailed(rawText, toolNames)
 	if len(textParsed.Calls) > 0 {
 		return textParsed
 	}
-	if strings.TrimSpace(text) != "" {
+	if strings.TrimSpace(visibleText) != "" {
 		return textParsed
 	}
 	thinking := detectionThinking
--- a/internal/httpapi/openai/shared/citation_links.go
+++ b/internal/httpapi/openai/shared/citation_links.go
@@ -7,25 +7,43 @@ import (
 	"strings"
 )

-var citationMarkerPattern = regexp.MustCompile(`(?i)\[citation:\s*(\d+)\]`)
+var citationMarkerPattern = regexp.MustCompile(`(?i)\[(citation|reference):\s*(\d+)\]`)

 func ReplaceCitationMarkersWithLinks(text string, links map[int]string) string {
 	if strings.TrimSpace(text) == "" || len(links) == 0 {
 		return text
 	}
+	zeroBasedReference := hasZeroBasedReferenceMarker(text)
 	return citationMarkerPattern.ReplaceAllStringFunc(text, func(match string) string {
 		sub := citationMarkerPattern.FindStringSubmatch(match)
-		if len(sub) < 2 {
+		if len(sub) < 3 {
 			return match
 		}
-		idx, err := strconv.Atoi(strings.TrimSpace(sub[1]))
-		if err != nil || idx <= 0 {
+		idx, err := strconv.Atoi(strings.TrimSpace(sub[2]))
+		if err != nil || idx < 0 {
 			return match
 		}
-		url := strings.TrimSpace(links[idx])
+		lookupIdx := idx
+		if strings.EqualFold(sub[1], "reference") && zeroBasedReference {
+			lookupIdx = idx + 1
+		}
+		url := strings.TrimSpace(links[lookupIdx])
 		if url == "" {
 			return match
 		}
 		return fmt.Sprintf("[%d](%s)", idx, url)
 	})
 }
+
+func hasZeroBasedReferenceMarker(text string) bool {
+	for _, sub := range citationMarkerPattern.FindAllStringSubmatch(text, -1) {
+		if len(sub) < 3 || !strings.EqualFold(sub[1], "reference") {
+			continue
+		}
+		idx, err := strconv.Atoi(strings.TrimSpace(sub[2]))
+		if err == nil && idx == 0 {
+			return true
+		}
+	}
+	return false
+}
--- a/internal/httpapi/openai/shared/handler_toolcall_format.go
+++ b/internal/httpapi/openai/shared/handler_toolcall_format.go
@@ -70,12 +70,13 @@ func FilterIncrementalToolCallDeltasByAllowed(deltas []toolstream.ToolCallDelta,
 	return out
 }

-func FormatFinalStreamToolCallsWithStableIDs(calls []toolcall.ParsedToolCall, ids map[int]string) []map[string]any {
+func FormatFinalStreamToolCallsWithStableIDs(calls []toolcall.ParsedToolCall, ids map[int]string, toolsRaw any) []map[string]any {
 	if len(calls) == 0 {
 		return nil
 	}
+	normalizedCalls := toolcall.NormalizeParsedToolCallsForSchemas(calls, toolsRaw)
 	out := make([]map[string]any, 0, len(calls))
-	for i, c := range calls {
+	for i, c := range normalizedCalls {
 		callID := ""
 		if ids != nil {
 			callID = strings.TrimSpace(ids[i])
--- a/internal/httpapi/openai/shared/leaked_output_sanitize.go
+++ b/internal/httpapi/openai/shared/leaked_output_sanitize.go
@@ -3,6 +3,8 @@ package shared
 import (
 	"regexp"
 	"strings"
+
+	"ds2api/internal/toolcall"
 )

 var emptyJSONFencePattern = regexp.MustCompile("(?is)```json\\s*```")
@@ -47,10 +49,42 @@ func sanitizeLeakedOutput(text string) string {
 	out = leakedThinkTagPattern.ReplaceAllString(out, "")
 	out = leakedBOSMarkerPattern.ReplaceAllString(out, "")
 	out = leakedMetaMarkerPattern.ReplaceAllString(out, "")
+	out = stripLeakedToolCallWrapperBlocks(out)
 	out = sanitizeLeakedAgentXMLBlocks(out)
 	return out
 }

+func stripLeakedToolCallWrapperBlocks(text string) string {
+	if text == "" {
+		return text
+	}
+	var b strings.Builder
+	pos := 0
+	for pos < len(text) {
+		tag, ok := toolcall.FindToolMarkupTagOutsideIgnored(text, pos)
+		if !ok {
+			b.WriteString(text[pos:])
+			break
+		}
+		if tag.Start > pos {
+			b.WriteString(text[pos:tag.Start])
+		}
+		if tag.Closing || tag.Name != "tool_calls" {
+			b.WriteString(text[tag.Start : tag.End+1])
+			pos = tag.End + 1
+			continue
+		}
+		closeTag, ok := toolcall.FindMatchingToolMarkupClose(text, tag)
+		if !ok {
+			b.WriteString(text[tag.Start : tag.End+1])
+			pos = tag.End + 1
+			continue
+		}
+		pos = closeTag.End + 1
+	}
+	return b.String()
+}
+
 func stripDanglingThinkSuffix(text string) string {
 	matches := leakedThinkTagPattern.FindAllStringIndex(text, -1)
 	if len(matches) == 0 {
--- a/internal/httpapi/requestbody/json_utf8.go
+++ b/internal/httpapi/requestbody/json_utf8.go
@@ -0,0 +1,134 @@
+package requestbody
+
+import (
+	"bytes"
+	"errors"
+	"io"
+	"mime"
+	"net/http"
+	"strings"
+	"unicode/utf8"
+)
+
+var (
+	ErrInvalidUTF8Body     = errors.New("invalid utf-8 request body")
+	errRequestBodyTooLarge = errors.New("request body too large")
+)
+
+const maxJSONUTF8ValidationSize = 100 << 20
+
+// ValidateJSONUTF8 validates complete JSON request bodies before downstream
+// decoders can silently replace malformed UTF-8 or stop before trailing bytes.
+func ValidateJSONUTF8(next http.Handler) http.Handler {
+	if next == nil {
+		return http.HandlerFunc(func(http.ResponseWriter, *http.Request) {})
+	}
+	return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		if shouldValidateJSONBody(r) {
+			r.Body = validateAndReplayBody(r.Body)
+		}
+		next.ServeHTTP(w, r)
+	})
+}
+
+func shouldValidateJSONBody(r *http.Request) bool {
+	if r == nil || r.Body == nil {
+		return false
+	}
+	path := ""
+	if r.URL != nil {
+		path = r.URL.Path
+	}
+	return isJSONContentType(r.Header.Get("Content-Type")) || isKnownJSONRequestPath(r.Method, path)
+}
+
+func isJSONContentType(raw string) bool {
+	raw = strings.TrimSpace(raw)
+	if raw == "" {
+		return false
+	}
+	mediaType, _, err := mime.ParseMediaType(raw)
+	if err != nil {
+		mediaType = raw
+	}
+	mediaType = strings.ToLower(strings.TrimSpace(mediaType))
+	return strings.Contains(mediaType, "json")
+}
+
+func isKnownJSONRequestPath(method, path string) bool {
+	switch strings.ToUpper(strings.TrimSpace(method)) {
+	case http.MethodPost, http.MethodPut, http.MethodPatch, http.MethodDelete:
+	default:
+		return false
+	}
+	path = strings.TrimSpace(path)
+	if path == "" {
+		return false
+	}
+	switch {
+	case path == "/v1/chat/completions" || path == "/chat/completions":
+		return true
+	case path == "/v1/responses" || path == "/responses":
+		return true
+	case path == "/v1/embeddings" || path == "/embeddings":
+		return true
+	case path == "/anthropic/v1/messages" || path == "/v1/messages" || path == "/messages":
+		return true
+	case path == "/anthropic/v1/messages/count_tokens" || path == "/v1/messages/count_tokens" || path == "/messages/count_tokens":
+		return true
+	case strings.HasPrefix(path, "/v1beta/models/") || strings.HasPrefix(path, "/v1/models/"):
+		return strings.Contains(path, ":generateContent") || strings.Contains(path, ":streamGenerateContent")
+	case strings.HasPrefix(path, "/admin/"):
+		return true
+	default:
+		return false
+	}
+}
+
+func validateAndReplayBody(body io.ReadCloser) io.ReadCloser {
+	if body == nil {
+		return body
+	}
+	raw, err := io.ReadAll(io.LimitReader(body, maxJSONUTF8ValidationSize+1))
+	if err != nil {
+		return &errorReadCloser{err: err, closer: body}
+	}
+	if len(raw) > maxJSONUTF8ValidationSize {
+		return &errorReadCloser{err: errRequestBodyTooLarge, closer: body}
+	}
+	if !utf8.Valid(raw) {
+		return &errorReadCloser{err: ErrInvalidUTF8Body, closer: body}
+	}
+	return &replayReadCloser{Reader: bytes.NewReader(raw), closer: body}
+}
+
+type replayReadCloser struct {
+	*bytes.Reader
+	closer io.Closer
+}
+
+func (r *replayReadCloser) Close() error {
+	if r == nil || r.closer == nil {
+		return nil
+	}
+	return r.closer.Close()
+}
+
+type errorReadCloser struct {
+	err    error
+	closer io.Closer
+}
+
+func (r *errorReadCloser) Read([]byte) (int, error) {
+	if r == nil || r.err == nil {
+		return 0, io.EOF
+	}
+	return 0, r.err
+}
+
+func (r *errorReadCloser) Close() error {
+	if r == nil || r.closer == nil {
+		return nil
+	}
+	return r.closer.Close()
+}
--- a/internal/httpapi/requestbody/json_utf8_test.go
+++ b/internal/httpapi/requestbody/json_utf8_test.go
@@ -0,0 +1,158 @@
+package requestbody
+
+import (
+	"bytes"
+	"encoding/json"
+	"io"
+	"net/http"
+	"net/http/httptest"
+	"strings"
+	"testing"
+)
+
+type singleByteReadCloser struct {
+	data []byte
+	pos  int
+}
+
+func (r *singleByteReadCloser) Read(p []byte) (int, error) {
+	if r.pos >= len(r.data) {
+		return 0, io.EOF
+	}
+	p[0] = r.data[r.pos]
+	r.pos++
+	return 1, nil
+}
+
+func (r *singleByteReadCloser) Close() error {
+	return nil
+}
+
+func TestValidateJSONUTF8AllowsSplitMultibyteRunes(t *testing.T) {
+	body := []byte(`{"text":"你好"}`)
+	handler := ValidateJSONUTF8(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		var req map[string]any
+		if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
+			t.Fatalf("unexpected decode error: %v", err)
+		}
+		w.WriteHeader(http.StatusNoContent)
+	}))
+
+	req := httptest.NewRequest(http.MethodPost, "/v1/chat/completions", &singleByteReadCloser{data: body})
+	req.Header.Set("Content-Type", "application/json")
+	rec := httptest.NewRecorder()
+
+	handler.ServeHTTP(rec, req)
+
+	if rec.Code != http.StatusNoContent {
+		t.Fatalf("expected 204 for valid utf-8 json, got %d body=%q", rec.Code, rec.Body.String())
+	}
+}
+
+func TestValidateJSONUTF8RejectsInvalidBytesBeforeJSONDecode(t *testing.T) {
+	body := append([]byte(`{"text":"`), 0xff)
+	body = append(body, []byte(`"}`)...)
+	handler := ValidateJSONUTF8(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		var req map[string]any
+		if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
+			w.WriteHeader(http.StatusBadRequest)
+			_, _ = w.Write([]byte(err.Error()))
+			return
+		}
+		w.WriteHeader(http.StatusOK)
+	}))
+
+	req := httptest.NewRequest(http.MethodPost, "/v1/chat/completions", bytes.NewReader(body))
+	req.Header.Set("Content-Type", "application/json; charset=utf-8")
+	rec := httptest.NewRecorder()
+
+	handler.ServeHTTP(rec, req)
+
+	if rec.Code != http.StatusBadRequest {
+		t.Fatalf("expected 400 for invalid utf-8 json, got %d body=%q", rec.Code, rec.Body.String())
+	}
+	if !strings.Contains(strings.ToLower(rec.Body.String()), "invalid utf-8") {
+		t.Fatalf("expected utf-8 validation error, got %q", rec.Body.String())
+	}
+}
+
+func TestValidateJSONUTF8RejectsInvalidBytesWithoutJSONContentTypeOnKnownPath(t *testing.T) {
+	body := append([]byte(`{"text":"`), 0xff)
+	body = append(body, []byte(`"}`)...)
+	handler := ValidateJSONUTF8(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		var req map[string]any
+		if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
+			w.WriteHeader(http.StatusBadRequest)
+			_, _ = w.Write([]byte(err.Error()))
+			return
+		}
+		w.WriteHeader(http.StatusOK)
+	}))
+
+	req := httptest.NewRequest(http.MethodPost, "/v1/chat/completions", bytes.NewReader(body))
+	req.Header.Set("Content-Type", "text/plain")
+	rec := httptest.NewRecorder()
+
+	handler.ServeHTTP(rec, req)
+
+	if rec.Code != http.StatusBadRequest {
+		t.Fatalf("expected 400 for invalid utf-8 json, got %d body=%q", rec.Code, rec.Body.String())
+	}
+	if !strings.Contains(strings.ToLower(rec.Body.String()), "invalid utf-8") {
+		t.Fatalf("expected utf-8 validation error, got %q", rec.Body.String())
+	}
+}
+
+func TestValidateJSONUTF8RejectsTrailingInvalidBytesAfterJSONValue(t *testing.T) {
+	body := append([]byte(`{"text":"ok"}`), 0xff)
+	handler := ValidateJSONUTF8(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		var req map[string]any
+		if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
+			w.WriteHeader(http.StatusBadRequest)
+			_, _ = w.Write([]byte(err.Error()))
+			return
+		}
+		w.WriteHeader(http.StatusOK)
+	}))
+
+	req := httptest.NewRequest(http.MethodPost, "/v1/chat/completions", bytes.NewReader(body))
+	req.Header.Set("Content-Type", "application/json")
+	rec := httptest.NewRecorder()
+
+	handler.ServeHTTP(rec, req)
+
+	if rec.Code != http.StatusBadRequest {
+		t.Fatalf("expected 400 for trailing invalid utf-8, got %d body=%q", rec.Code, rec.Body.String())
+	}
+	if !strings.Contains(strings.ToLower(rec.Body.String()), "invalid utf-8") {
+		t.Fatalf("expected utf-8 validation error, got %q", rec.Body.String())
+	}
+}
+
+func TestIsJSONContentType(t *testing.T) {
+	for _, raw := range []string{
+		"application/json",
+		"application/json; charset=utf-8",
+		"application/problem+json",
+		"application/vnd.api+json",
+	} {
+		if !isJSONContentType(raw) {
+			t.Fatalf("expected %q to be recognized as json", raw)
+		}
+	}
+	for _, raw := range []string{
+		"multipart/form-data; boundary=abc",
+		"text/plain",
+		"application/octet-stream",
+	} {
+		if isJSONContentType(raw) {
+			t.Fatalf("expected %q not to be recognized as json", raw)
+		}
+	}
+}
+
+func TestIsKnownJSONRequestPathIncludesGeminiStream(t *testing.T) {
+	if !isKnownJSONRequestPath(http.MethodPost, "/v1beta/models/gemini-pro:streamGenerateContent") {
+		t.Fatal("expected Gemini stream generate path to be recognized as json")
+	}
+}
--- a/internal/js/chat-stream/sse_parse_impl.js
+++ b/internal/js/chat-stream/sse_parse_impl.js
@@ -70,7 +70,6 @@ function finalizeThinkingParts(parts, thinkingEnabled, newType) {
  }
  if (!thinkingEnabled) {
    finalParts = dropThinkingParts(finalParts);
-    finalType = 'text';
  }
  return { parts: finalParts, newType: finalType };
 }
@@ -213,6 +212,12 @@ function parseChunkForContent(chunk, thinkingEnabled, currentType, stripReferenc
    }
  }

+  if (pathValue === 'response/content') {
+    newType = 'text';
+  } else if (pathValue === 'response/thinking_content' && (!thinkingEnabled || newType !== 'text')) {
+    newType = 'thinking';
+  }
+
  let partType = 'text';
  if (pathValue === 'response/thinking_content') {
    if (!thinkingEnabled) {
@@ -226,8 +231,8 @@ function parseChunkForContent(chunk, thinkingEnabled, currentType, stripReferenc
    partType = 'text';
  } else if (pathValue.includes('response/fragments') && pathValue.includes('/content')) {
    partType = newType;
-  } else if (!pathValue && thinkingEnabled) {
-    partType = newType;
+  } else if (!pathValue) {
+    partType = newType || 'text';
  }

  const val = chunk.v;
@@ -308,6 +313,10 @@ function parseChunkForContent(chunk, thinkingEnabled, currentType, stripReferenc
  }

  if (val && typeof val === 'object') {
+    const directContent = asContentString(val, stripReferenceMarkers);
+    if (directContent) {
+      parts.push({ text: directContent, type: partType });
+    }
    const resp = val.response && typeof val.response === 'object' ? val.response : val;
    if (Array.isArray(resp.fragments)) {
      for (const frag of resp.fragments) {
@@ -593,6 +602,12 @@ function asContentString(v, stripReferenceMarkers = true) {
    if (Object.prototype.hasOwnProperty.call(v, 'v')) {
      return asContentString(v.v, stripReferenceMarkers);
    }
+    if (Object.prototype.hasOwnProperty.call(v, 'text')) {
+      return asContentString(v.text, stripReferenceMarkers);
+    }
+    if (Object.prototype.hasOwnProperty.call(v, 'value')) {
+      return asContentString(v.value, stripReferenceMarkers);
+    }
    return '';
  }
  if (v == null) {
--- a/internal/js/chat-stream/stream_emitter.js
+++ b/internal/js/chat-stream/stream_emitter.js
@@ -1,5 +1,8 @@
 'use strict';

+const MIN_DELTA_FLUSH_CHARS = 160;
+const MAX_DELTA_FLUSH_WAIT_MS = 80;
+
 function createChatCompletionEmitter({ res, sessionID, created, model, isClosed }) {
  let firstChunkSent = false;

@@ -34,6 +37,62 @@ function createChatCompletionEmitter({ res, sessionID, created, model, isClosed
  };
 }

+function createDeltaCoalescer({ sendDeltaFrame, minFlushChars = MIN_DELTA_FLUSH_CHARS, maxFlushWaitMS = MAX_DELTA_FLUSH_WAIT_MS }) {
+  let pendingField = '';
+  let pendingText = '';
+  let flushTimer = null;
+
+  const clearFlushTimer = () => {
+    if (flushTimer) {
+      clearTimeout(flushTimer);
+      flushTimer = null;
+    }
+  };
+
+  const flush = () => {
+    clearFlushTimer();
+    if (!pendingField || !pendingText) {
+      return;
+    }
+    const delta = { [pendingField]: pendingText };
+    pendingField = '';
+    pendingText = '';
+    sendDeltaFrame(delta);
+  };
+
+  const scheduleFlush = () => {
+    if (flushTimer || maxFlushWaitMS <= 0) {
+      return;
+    }
+    flushTimer = setTimeout(flush, maxFlushWaitMS);
+    if (typeof flushTimer.unref === 'function') {
+      flushTimer.unref();
+    }
+  };
+
+  const append = (field, text) => {
+    if (!field || !text) {
+      return;
+    }
+    if (pendingField && pendingField !== field) {
+      flush();
+    }
+    pendingField = field;
+    pendingText += text;
+    if ([...pendingText].length >= minFlushChars) {
+      flush();
+      return;
+    }
+    scheduleFlush();
+  };
+
+  return {
+    append,
+    flush,
+  };
+}
+
 module.exports = {
  createChatCompletionEmitter,
+  createDeltaCoalescer,
 };
--- a/internal/js/chat-stream/vercel_stream_impl.js
+++ b/internal/js/chat-stream/vercel_stream_impl.js
@@ -20,7 +20,7 @@ const {
  boolDefaultTrue,
  resetStreamToolCallState,
 } = require('./toolcall_policy');
-const { createChatCompletionEmitter } = require('./stream_emitter');
+const { createChatCompletionEmitter, createDeltaCoalescer } = require('./stream_emitter');
 const {
  asString,
  isAbortError,
@@ -191,6 +191,7 @@ async function handleVercelStream(req, res, rawBody, payload) {
      model,
      isClosed: () => clientClosed,
    });
+    const deltaCoalescer = createDeltaCoalescer({ sendDeltaFrame });

    const finish = async (reason, options = {}) => {
      if (ended) {
@@ -201,25 +202,28 @@ async function handleVercelStream(req, res, rawBody, payload) {
        await releaseLease();
        return true;
      }
+      deltaCoalescer.flush();
      const detected = parseStandaloneToolCalls(outputText, toolNames);
      if (detected.length > 0 && !toolCallsDoneEmitted) {
        toolCallsEmitted = true;
        toolCallsDoneEmitted = true;
-        sendDeltaFrame({ tool_calls: formatOpenAIStreamToolCalls(detected, streamToolCallIDs) });
+        sendDeltaFrame({ tool_calls: formatOpenAIStreamToolCalls(detected, streamToolCallIDs, payload.tools) });
      } else if (toolSieveEnabled) {
        const tailEvents = flushToolSieve(toolSieveState, toolNames);
        for (const evt of tailEvents) {
          if (evt.type === 'tool_calls' && Array.isArray(evt.calls) && evt.calls.length > 0) {
+            deltaCoalescer.flush();
            toolCallsEmitted = true;
            toolCallsDoneEmitted = true;
-            sendDeltaFrame({ tool_calls: formatOpenAIStreamToolCalls(evt.calls, streamToolCallIDs) });
+            sendDeltaFrame({ tool_calls: formatOpenAIStreamToolCalls(evt.calls, streamToolCallIDs, payload.tools) });
            resetStreamToolCallState(streamToolCallIDs, streamToolNames);
            continue;
          }
          if (evt.text) {
-            sendDeltaFrame({ content: evt.text });
+            deltaCoalescer.append('content', evt.text);
          }
        }
+        deltaCoalescer.flush();
      }
      if (detected.length > 0 || toolCallsEmitted) {
        reason = 'tool_calls';
@@ -327,7 +331,7 @@ async function handleVercelStream(req, res, rawBody, payload) {
                      continue;
                    }
                    thinkingText += trimmed;
-                    sendDeltaFrame({ reasoning_content: trimmed });
+                    deltaCoalescer.append('reasoning_content', trimmed);
                  }
                } else {
                  const trimmed = trimContinuationOverlap(outputText, p.text);
@@ -339,7 +343,7 @@ async function handleVercelStream(req, res, rawBody, payload) {
                  }
                  outputText += trimmed;
                  if (!toolSieveEnabled) {
-                    sendDeltaFrame({ content: trimmed });
+                    deltaCoalescer.append('content', trimmed);
                    continue;
                  }
                  const events = processToolSieveChunk(toolSieveState, trimmed, toolNames);
@@ -352,6 +356,7 @@ async function handleVercelStream(req, res, rawBody, payload) {
                      const formatted = formatIncrementalToolCallDeltas(filtered, streamToolCallIDs);
                      if (formatted.length > 0) {
                        toolCallsEmitted = true;
+                        deltaCoalescer.flush();
                        sendDeltaFrame({ tool_calls: formatted });
                      }
                      continue;
@@ -359,12 +364,13 @@ async function handleVercelStream(req, res, rawBody, payload) {
                    if (evt.type === 'tool_calls') {
                      toolCallsEmitted = true;
                      toolCallsDoneEmitted = true;
-                      sendDeltaFrame({ tool_calls: formatOpenAIStreamToolCalls(evt.calls, streamToolCallIDs) });
+                      deltaCoalescer.flush();
+                      sendDeltaFrame({ tool_calls: formatOpenAIStreamToolCalls(evt.calls, streamToolCallIDs, payload.tools) });
                      resetStreamToolCallState(streamToolCallIDs, streamToolNames);
                      continue;
                    }
                    if (evt.text) {
-                      sendDeltaFrame({ content: evt.text });
+                      deltaCoalescer.append('content', evt.text);
                    }
                  }
                }
@@ -510,27 +516,87 @@ function observeContinueState(state, chunk) {
  if (topID > 0) {
    state.responseMessageID = topID;
  }
-  if (chunk.p === 'response/status') {
-    setContinueStatus(state, asString(chunk.v));
+  observeContinueDirectPatch(state, chunk.p, chunk.v);
+  if (chunk.p === 'response') {
+    observeContinueBatchPatches(state, 'response', chunk.v);
+  } else {
+    observeContinueBatchPatches(state, '', chunk.v);
  }
  const response = chunk.v && typeof chunk.v === 'object' ? chunk.v.response : null;
-  if (response && typeof response === 'object') {
-    const id = numberValue(response.message_id);
-    if (id > 0) {
-      state.responseMessageID = id;
-    }
-    setContinueStatus(state, asString(response.status));
-    if (response.auto_continue === true) {
-      state.lastStatus = 'AUTO_CONTINUE';
-    }
-  }
+  observeContinueResponseObject(state, response);
  const messageResponse = chunk.message && typeof chunk.message === 'object' && chunk.message.response;
-  if (messageResponse && typeof messageResponse === 'object') {
-    const id = numberValue(messageResponse.message_id);
-    if (id > 0) {
-      state.responseMessageID = id;
+  observeContinueResponseObject(state, messageResponse);
+}
+
+function observeContinueDirectPatch(state, path, value) {
+  if (!state) {
+    return;
+  }
+  switch (asString(path).trim().replace(/^\/+|\/+$/g, '')) {
+    case 'response/status':
+    case 'status':
+    case 'response/quasi_status':
+    case 'quasi_status':
+      setContinueStatus(state, asString(value));
+      break;
+    case 'response/auto_continue':
+    case 'auto_continue':
+      if (value === true) {
+        state.lastStatus = 'AUTO_CONTINUE';
+      }
+      break;
+    default:
+      break;
+  }
+}
+
+function observeContinueResponseObject(state, response) {
+  if (!state || !response || typeof response !== 'object') {
+    return;
+  }
+  const id = numberValue(response.message_id);
+  if (id > 0) {
+    state.responseMessageID = id;
+  }
+  setContinueStatus(state, asString(response.status));
+  if (response.auto_continue === true) {
+    state.lastStatus = 'AUTO_CONTINUE';
+  }
+}
+
+function observeContinueBatchPatches(state, parentPath, raw) {
+  if (!state || !Array.isArray(raw)) {
+    return;
+  }
+  for (const patch of raw) {
+    if (!patch || typeof patch !== 'object') {
+      continue;
+    }
+    const path = asString(patch.p).trim();
+    if (!path) {
+      continue;
+    }
+    let fullPath = path;
+    const parent = asString(parentPath).trim().replace(/^\/+|\/+$/g, '');
+    if (parent && !path.includes('/')) {
+      fullPath = `${parent}/${path}`;
+    }
+    switch (fullPath.replace(/^\/+|\/+$/g, '')) {
+      case 'response/status':
+      case 'status':
+      case 'response/quasi_status':
+      case 'quasi_status':
+        setContinueStatus(state, asString(patch.v));
+        break;
+      case 'response/auto_continue':
+      case 'auto_continue':
+        if (patch.v === true) {
+          state.lastStatus = 'AUTO_CONTINUE';
+        }
+        break;
+      default:
+        break;
    }
-    setContinueStatus(state, asString(messageResponse.status));
  }
 }

@@ -540,7 +606,7 @@ function setContinueStatus(state, status) {
    return;
  }
  state.lastStatus = normalized;
-  if (normalized.toUpperCase() === 'FINISHED') {
+  if (['FINISHED', 'CONTENT_FILTER'].includes(normalized.toUpperCase())) {
    state.finished = true;
  }
 }
@@ -549,7 +615,7 @@ function shouldAutoContinue(state) {
  if (!state || state.finished || !state.sessionID || state.responseMessageID <= 0) {
    return false;
  }
-  return ['WIP', 'INCOMPLETE', 'AUTO_CONTINUE'].includes(asString(state.lastStatus).trim().toUpperCase());
+  return ['INCOMPLETE', 'AUTO_CONTINUE'].includes(asString(state.lastStatus).trim().toUpperCase());
 }

 function numberValue(v) {
--- a/internal/js/helpers/stream-tool-sieve/format.js
+++ b/internal/js/helpers/stream-tool-sieve/format.js
@@ -2,11 +2,12 @@

 const crypto = require('crypto');

-function formatOpenAIStreamToolCalls(calls, idStore) {
+function formatOpenAIStreamToolCalls(calls, idStore, toolsRaw) {
  if (!Array.isArray(calls) || calls.length === 0) {
    return [];
  }
-  return calls.map((c, idx) => ({
+  const normalized = normalizeParsedToolCallsForSchemas(calls, toolsRaw);
+  return normalized.map((c, idx) => ({
    index: idx,
    id: ensureStreamToolCallID(idStore, idx),
    type: 'function',
@@ -17,6 +18,194 @@ function formatOpenAIStreamToolCalls(calls, idStore) {
  }));
 }

+function normalizeParsedToolCallsForSchemas(calls, toolsRaw) {
+  if (!Array.isArray(calls) || calls.length === 0) {
+    return calls;
+  }
+  const schemas = buildToolSchemaIndex(toolsRaw);
+  if (!schemas) {
+    return calls;
+  }
+  let changedAny = false;
+  const out = calls.map((call) => {
+    const name = String(call && call.name || '').trim().toLowerCase();
+    const schema = schemas[name];
+    if (!schema || !call || !call.input || typeof call.input !== 'object' || Array.isArray(call.input)) {
+      return call;
+    }
+    const [normalized, changed] = normalizeToolValueWithSchema(call.input, schema);
+    if (!changed || !normalized || typeof normalized !== 'object' || Array.isArray(normalized)) {
+      return call;
+    }
+    changedAny = true;
+    return { ...call, input: normalized };
+  });
+  return changedAny ? out : calls;
+}
+
+function buildToolSchemaIndex(toolsRaw) {
+  if (!Array.isArray(toolsRaw) || toolsRaw.length === 0) {
+    return null;
+  }
+  const out = {};
+  for (const item of toolsRaw) {
+    if (!item || typeof item !== 'object' || Array.isArray(item)) {
+      continue;
+    }
+    const [name, schema] = extractToolNameAndSchema(item);
+    if (!name || !schema || typeof schema !== 'object' || Array.isArray(schema)) {
+      continue;
+    }
+    out[name.toLowerCase()] = schema;
+  }
+  return Object.keys(out).length > 0 ? out : null;
+}
+
+function extractToolNameAndSchema(tool) {
+  const fn = tool && typeof tool.function === 'object' && !Array.isArray(tool.function) ? tool.function : null;
+  const name = firstNonEmptyString(tool.name, fn && fn.name);
+  const schema = firstNonNil(
+    tool.parameters,
+    tool.input_schema,
+    tool.inputSchema,
+    tool.schema,
+    fn && fn.parameters,
+    fn && fn.input_schema,
+    fn && fn.inputSchema,
+    fn && fn.schema,
+  );
+  return [name, schema];
+}
+
+function normalizeToolValueWithSchema(value, schema) {
+  if (value == null || !schema || typeof schema !== 'object' || Array.isArray(schema)) {
+    return [value, false];
+  }
+  if (shouldCoerceSchemaToString(schema)) {
+    return stringifySchemaValue(value);
+  }
+  if (looksLikeObjectSchema(schema)) {
+    if (!value || typeof value !== 'object' || Array.isArray(value)) {
+      return [value, false];
+    }
+    const properties = schema.properties && typeof schema.properties === 'object' && !Array.isArray(schema.properties) ? schema.properties : null;
+    const additional = schema.additionalProperties;
+    let changed = false;
+    const out = {};
+    for (const [key, current] of Object.entries(value)) {
+      let next = current;
+      let fieldChanged = false;
+      if (properties && Object.prototype.hasOwnProperty.call(properties, key)) {
+        [next, fieldChanged] = normalizeToolValueWithSchema(current, properties[key]);
+      } else if (additional != null) {
+        [next, fieldChanged] = normalizeToolValueWithSchema(current, additional);
+      }
+      out[key] = next;
+      changed = changed || fieldChanged;
+    }
+    return changed ? [out, true] : [value, false];
+  }
+  if (looksLikeArraySchema(schema)) {
+    if (!Array.isArray(value) || value.length === 0 || schema.items == null) {
+      return [value, false];
+    }
+    let changed = false;
+    const out = value.map((item, idx) => {
+      const itemSchema = Array.isArray(schema.items) ? schema.items[idx] : schema.items;
+      if (itemSchema == null) {
+        return item;
+      }
+      const [next, itemChanged] = normalizeToolValueWithSchema(item, itemSchema);
+      changed = changed || itemChanged;
+      return next;
+    });
+    return changed ? [out, true] : [value, false];
+  }
+  return [value, false];
+}
+
+function shouldCoerceSchemaToString(schema) {
+  if (!schema || typeof schema !== 'object' || Array.isArray(schema)) {
+    return false;
+  }
+  if (typeof schema.const === 'string') {
+    return true;
+  }
+  if (Array.isArray(schema.enum) && schema.enum.length > 0 && schema.enum.every((item) => typeof item === 'string')) {
+    return true;
+  }
+  if (typeof schema.type === 'string') {
+    return schema.type.trim().toLowerCase() === 'string';
+  }
+  if (Array.isArray(schema.type) && schema.type.length > 0) {
+    let hasString = false;
+    for (const item of schema.type) {
+      if (typeof item !== 'string') {
+        return false;
+      }
+      const typ = item.trim().toLowerCase();
+      if (typ === 'string') {
+        hasString = true;
+      } else if (typ !== 'null') {
+        return false;
+      }
+    }
+    return hasString;
+  }
+  return false;
+}
+
+function looksLikeObjectSchema(schema) {
+  return !!schema && typeof schema === 'object' && !Array.isArray(schema) && (
+    (typeof schema.type === 'string' && schema.type.trim().toLowerCase() === 'object') ||
+    (schema.properties && typeof schema.properties === 'object' && !Array.isArray(schema.properties)) ||
+    schema.additionalProperties != null
+  );
+}
+
+function looksLikeArraySchema(schema) {
+  return !!schema && typeof schema === 'object' && !Array.isArray(schema) && (
+    (typeof schema.type === 'string' && schema.type.trim().toLowerCase() === 'array') ||
+    schema.items != null
+  );
+}
+
+function stringifySchemaValue(value) {
+  if (value == null) {
+    return [value, false];
+  }
+  if (typeof value === 'string') {
+    return [value, false];
+  }
+  try {
+    return [JSON.stringify(value), true];
+  } catch {
+    return [value, false];
+  }
+}
+
+function firstNonNil(...values) {
+  for (const value of values) {
+    if (value != null) {
+      return value;
+    }
+  }
+  return null;
+}
+
+function firstNonEmptyString(...values) {
+  for (const value of values) {
+    if (typeof value !== 'string') {
+      continue;
+    }
+    const trimmed = value.trim();
+    if (trimmed) {
+      return trimmed;
+    }
+  }
+  return '';
+}
+
 function ensureStreamToolCallID(idStore, index) {
  if (!(idStore instanceof Map)) {
    return `call_${newCallID()}`;
--- a/internal/js/helpers/stream-tool-sieve/parse_payload.js
+++ b/internal/js/helpers/stream-tool-sieve/parse_payload.js
@@ -248,6 +248,9 @@ function replaceDSMLToolMarkupOutsideIgnored(text) {
    if (tag) {
      if (tag.dsmlLike) {
        out += `<${tag.closing ? '/' : ''}${tag.name}${raw.slice(tag.nameEnd, tag.end + 1)}`;
+        if (raw[tag.end] !== '>') {
+          out += '>';
+        }
      } else {
        out += raw.slice(tag.start, tag.end + 1);
      }
@@ -424,31 +427,42 @@ function scanToolMarkupTagAt(text, start) {
  }
  const lower = raw.toLowerCase();
  let i = start + 1;
+  while (i < raw.length && raw[i] === '<') {
+    i += 1;
+  }
  const closing = raw[i] === '/';
  if (closing) {
    i += 1;
  }
-  let dsmlLike = false;
-  if (i < raw.length && isToolMarkupPipe(raw[i])) {
-    dsmlLike = true;
-    i += 1;
-  }
-  if (lower.startsWith('dsml', i)) {
-    dsmlLike = true;
-    i += 'dsml'.length;
-    while (i < raw.length && isToolMarkupSeparator(raw[i])) {
-      i += 1;
-    }
-  }
+  const prefix = consumeToolMarkupNamePrefix(raw, lower, i);
+  i = prefix.next;
+  const dsmlLike = prefix.dsmlLike;
  const { name, len } = matchToolMarkupName(lower, i);
  if (!name) {
    return null;
  }
-  const nameEnd = i + len;
+  const originalNameEnd = i + len;
+  let nameEnd = originalNameEnd;
+  while (nameEnd < raw.length && isToolMarkupPipe(raw[nameEnd])) {
+    nameEnd += 1;
+  }
+  const hasTrailingPipe = nameEnd > originalNameEnd;
  if (!hasXmlTagBoundary(raw, nameEnd)) {
    return null;
  }
-  const end = findXmlTagEnd(raw, nameEnd);
+  let end = findXmlTagEnd(raw, nameEnd);
+  if (end < 0) {
+    if (!hasTrailingPipe) {
+      return null;
+    }
+    end = nameEnd - 1;
+  }
+  if (hasTrailingPipe) {
+    const nextLT = raw.indexOf('<', nameEnd);
+    if (nextLT >= 0 && end >= nextLT) {
+      end = nameEnd - 1;
+    }
+  }
  if (end < 0) {
    return null;
  }
@@ -520,36 +534,94 @@ function findPartialToolMarkupStart(text) {
  if (lastLT < 0) {
    return -1;
  }
-  const tail = raw.slice(lastLT);
+  const start = includeDuplicateLeadingLessThan(raw, lastLT);
+  const tail = raw.slice(start);
  if (tail.includes('>')) {
    return -1;
  }
-  const lowerTail = tail.toLowerCase();
-  const candidates = [
-    '<tool_calls', '<invoke', '<parameter',
-    '<|tool_calls', '<|invoke', '<|parameter',
-    '<｜tool_calls', '<｜invoke', '<｜parameter',
-    '<|dsml|tool_calls', '<|dsml|invoke', '<|dsml|parameter',
-    '<dsmltool_calls', '<dsmlinvoke', '<dsmlparameter',
-    '<dsml tool_calls', '<dsml invoke', '<dsml parameter',
-    '<dsml|tool_calls', '<dsml|invoke', '<dsml|parameter',
-    '<|dsmltool_calls', '<|dsmlinvoke', '<|dsmlparameter',
-    '<|dsml tool_calls', '<|dsml invoke', '<|dsml parameter',
-  ];
-  for (const candidate of candidates) {
-    if (candidate.startsWith(lowerTail)) {
-      return lastLT;
-    }
+  return isPartialToolMarkupTagPrefix(tail) ? start : -1;
+}
+
+function includeDuplicateLeadingLessThan(text, idx) {
+  let out = idx;
+  while (out > 0 && text[out - 1] === '<') {
+    out -= 1;
  }
-  return -1;
+  return out;
 }

 function isToolMarkupPipe(ch) {
  return ch === '|' || ch === '｜';
 }

-function isToolMarkupSeparator(ch) {
-  return ch === ' ' || ch === '\t' || ch === '\r' || ch === '\n' || isToolMarkupPipe(ch);
+function isPartialToolMarkupTagPrefix(text) {
+  const raw = toStringSafe(text);
+  if (!raw || raw[0] !== '<' || raw.includes('>')) {
+    return false;
+  }
+  const lower = raw.toLowerCase();
+  let i = 1;
+  while (i < raw.length && raw[i] === '<') {
+    i += 1;
+  }
+  if (i >= raw.length) {
+    return true;
+  }
+  if (raw[i] === '/') {
+    i += 1;
+  }
+  while (i <= raw.length) {
+    if (i === raw.length) {
+      return true;
+    }
+    if (hasToolMarkupNamePrefix(lower.slice(i))) {
+      return true;
+    }
+    if ('dsml'.startsWith(lower.slice(i))) {
+      return true;
+    }
+    const next = consumeToolMarkupNamePrefixOnce(raw, lower, i);
+    if (!next.ok) {
+      return false;
+    }
+    i = next.next;
+  }
+  return false;
+}
+
+function consumeToolMarkupNamePrefix(raw, lower, idx) {
+  let next = idx;
+  let dsmlLike = false;
+  while (true) {
+    const consumed = consumeToolMarkupNamePrefixOnce(raw, lower, next);
+    if (!consumed.ok) {
+      return { next, dsmlLike };
+    }
+    next = consumed.next;
+    dsmlLike = true;
+  }
+}
+
+function consumeToolMarkupNamePrefixOnce(raw, lower, idx) {
+  if (idx < raw.length && isToolMarkupPipe(raw[idx])) {
+    return { next: idx + 1, ok: true };
+  }
+  if (idx < raw.length && [' ', '\t', '\r', '\n'].includes(raw[idx])) {
+    return { next: idx + 1, ok: true };
+  }
+  if (lower.startsWith('dsml', idx)) {
+    return { next: idx + 'dsml'.length, ok: true };
+  }
+  return { next: idx, ok: false };
+}
+
+function hasToolMarkupNamePrefix(lowerTail) {
+  for (const name of TOOL_MARKUP_NAMES) {
+    if (lowerTail.startsWith(name) || name.startsWith(lowerTail)) {
+      return true;
+    }
+  }
+  return false;
 }

 function matchToolMarkupName(lower, start) {
@@ -774,10 +846,18 @@ function parseMarkupValue(raw, paramName = '') {
  if (cdata.ok) {
    const literal = parseJSONLiteralValue(cdata.value);
    if (literal.ok) {
+      const literalArray = coerceArrayValue(literal.value, paramName);
+      if (literalArray.ok) {
+        return literalArray.value;
+      }
      return literal.value;
    }
    const structured = parseStructuredCDATAParameterValue(paramName, cdata.value);
-    return structured.ok ? structured.value : cdata.value;
+    if (structured.ok) {
+      return structured.value;
+    }
+    const looseArray = parseLooseJSONArrayValue(cdata.value, paramName);
+    return looseArray.ok ? looseArray.value : cdata.value;
  }
  const s = toStringSafe(extractRawTagValue(raw)).trim();
  if (!s) {
@@ -790,8 +870,14 @@ function parseMarkupValue(raw, paramName = '') {
      return nested;
    }
    if (nested && typeof nested === 'object') {
+      const nestedArray = coerceArrayValue(nested, paramName);
+      if (nestedArray.ok) {
+        return nestedArray.value;
+      }
      if (isOnlyRawValue(nested)) {
-        return toStringSafe(nested._raw);
+        const rawValue = toStringSafe(nested._raw);
+        const looseArray = parseLooseJSONArrayValue(rawValue, paramName);
+        return looseArray.ok ? looseArray.value : rawValue;
      }
      return nested;
    }
@@ -799,8 +885,16 @@ function parseMarkupValue(raw, paramName = '') {

  const literal = parseJSONLiteralValue(s);
  if (literal.ok) {
+    const literalArray = coerceArrayValue(literal.value, paramName);
+    if (literalArray.ok) {
+      return literalArray.value;
+    }
    return literal.value;
  }
+  const looseArray = parseLooseJSONArrayValue(s, paramName);
+  if (looseArray.ok) {
+    return looseArray.value;
+  }
  return s;
 }

@@ -812,6 +906,9 @@ function parseStructuredCDATAParameterValue(paramName, raw) {
  if (!normalized.includes('<') || !normalized.includes('>')) {
    return { ok: false, value: null };
  }
+  if (!cdataFragmentLooksExplicitlyStructured(normalized)) {
+    return { ok: false, value: null };
+  }
  const parsed = parseMarkupInput(normalized);
  if (Array.isArray(parsed)) {
    return { ok: true, value: parsed };
@@ -826,6 +923,21 @@ function normalizeCDATAForStructuredParse(raw) {
  return unescapeHtml(toStringSafe(raw).replace(/<br\s*\/?>/gi, '\n').trim());
 }

+function cdataFragmentLooksExplicitlyStructured(raw) {
+  const blocks = findGenericXmlElementBlocks(raw);
+  if (blocks.length === 0) {
+    return false;
+  }
+  if (blocks.length > 1) {
+    return true;
+  }
+  const block = blocks[0];
+  if (toStringSafe(block.localName).trim().toLowerCase() === 'item') {
+    return true;
+  }
+  return findGenericXmlElementBlocks(block.body).length > 0;
+}
+
 function preservesCDATAStringParameter(name) {
  return new Set([
    'content',
@@ -918,6 +1030,226 @@ function parseJSONLiteralValue(raw) {
  }
 }

+function parseLooseJSONArrayValue(raw, paramName = '') {
+  if (preservesCDATAStringParameter(paramName)) {
+    return { ok: false, value: null };
+  }
+  const s = toStringSafe(raw).trim();
+  if (!s) {
+    return { ok: false, value: null };
+  }
+  const candidate = parseLooseJSONArrayCandidate(s, paramName);
+  if (candidate.ok) {
+    return candidate;
+  }
+
+  const segments = splitTopLevelJSONValues(s);
+  if (segments.length < 2) {
+    return { ok: false, value: null };
+  }
+
+  const out = [];
+  for (const segment of segments) {
+    const parsed = parseLooseArrayElementValue(segment);
+    if (!parsed.ok) {
+      return { ok: false, value: null };
+    }
+    out.push(parsed.value);
+  }
+  return { ok: true, value: out };
+}
+
+function parseLooseJSONArrayCandidate(raw, paramName = '') {
+  const parsed = parseLooseArrayElementValue(raw);
+  if (!parsed.ok) {
+    return { ok: false, value: null };
+  }
+  return coerceArrayValue(parsed.value, paramName);
+}
+
+function parseLooseArrayElementValue(raw) {
+  const s = toStringSafe(raw).trim();
+  if (!s) {
+    return { ok: false, value: null };
+  }
+
+  const literal = parseJSONLiteralValue(s);
+  if (literal.ok) {
+    return literal;
+  }
+
+  const repairedBackslashes = repairInvalidJSONBackslashes(s);
+  if (repairedBackslashes !== s) {
+    try {
+      const parsed = JSON.parse(repairedBackslashes);
+      return { ok: true, value: parsed };
+    } catch (_err) {
+      // Fall through.
+    }
+  }
+
+  const repairedLoose = repairLooseJSON(s);
+  if (repairedLoose !== s) {
+    try {
+      const parsed = JSON.parse(repairedLoose);
+      return { ok: true, value: parsed };
+    } catch (_err) {
+      // Fall through.
+    }
+  }
+
+  if (s.includes('<') && s.includes('>')) {
+    const parsed = parseMarkupInput(s);
+    if (Array.isArray(parsed)) {
+      return { ok: true, value: parsed };
+    }
+    if (parsed && typeof parsed === 'object') {
+      return { ok: true, value: parsed };
+    }
+  }
+
+  return { ok: false, value: null };
+}
+
+function coerceArrayValue(value, paramName = '') {
+  if (Array.isArray(value)) {
+    return { ok: true, value };
+  }
+  if (!value || typeof value !== 'object') {
+    return { ok: false, value: null };
+  }
+
+  const keys = Object.keys(value);
+  if (keys.length !== 1) {
+    return { ok: false, value: null };
+  }
+
+  if (Object.prototype.hasOwnProperty.call(value, 'item')) {
+    const items = value.item;
+    const nested = coerceArrayValue(items, '');
+    return nested.ok ? nested : { ok: true, value: [items] };
+  }
+
+  if (paramName && Object.prototype.hasOwnProperty.call(value, paramName)) {
+    const nested = coerceArrayValue(value[paramName], '');
+    if (nested.ok) {
+      return nested;
+    }
+  }
+
+  return { ok: false, value: null };
+}
+
+function splitTopLevelJSONValues(raw) {
+  const s = toStringSafe(raw).trim();
+  if (!s) {
+    return [];
+  }
+
+  const values = [];
+  let start = 0;
+  let depth = 0;
+  let inString = false;
+  let escaped = false;
+
+  for (let i = 0; i < s.length; i += 1) {
+    const ch = s[i];
+    if (inString) {
+      if (escaped) {
+        escaped = false;
+        continue;
+      }
+      if (ch === '\\') {
+        escaped = true;
+        continue;
+      }
+      if (ch === '"') {
+        inString = false;
+      }
+      continue;
+    }
+    if (ch === '"') {
+      inString = true;
+      continue;
+    }
+    if (ch === '{' || ch === '[') {
+      depth += 1;
+      continue;
+    }
+    if (ch === '}' || ch === ']') {
+      if (depth > 0) {
+        depth -= 1;
+      }
+      continue;
+    }
+    if (ch === ',' && depth === 0) {
+      const segment = s.slice(start, i).trim();
+      if (!segment) {
+        return [];
+      }
+      values.push(segment);
+      start = i + 1;
+    }
+  }
+
+  const last = s.slice(start).trim();
+  if (!last) {
+    return [];
+  }
+  values.push(last);
+  return values.length > 1 ? values : [];
+}
+
+function repairInvalidJSONBackslashes(s) {
+  if (!s || !s.includes('\\')) {
+    return s;
+  }
+
+  let out = '';
+  for (let i = 0; i < s.length; i += 1) {
+    const ch = s[i];
+    if (ch !== '\\') {
+      out += ch;
+      continue;
+    }
+    if (i + 1 < s.length) {
+      const next = s[i + 1];
+      if ('"\\/bfnrt'.includes(next)) {
+        out += `\\${next}`;
+        i += 1;
+        continue;
+      }
+      if (next === 'u' && i + 5 < s.length) {
+        let isHex = true;
+        for (let j = 1; j <= 4; j += 1) {
+          const r = s[i + 1 + j];
+          if (!/[0-9a-fA-F]/.test(r)) {
+            isHex = false;
+            break;
+          }
+        }
+        if (isHex) {
+          out += `\\u${s.slice(i + 2, i + 6)}`;
+          i += 5;
+          continue;
+        }
+      }
+    }
+    out += '\\\\';
+  }
+  return out;
+}
+
+function repairLooseJSON(s) {
+  const raw = toStringSafe(s).trim();
+  if (!raw) {
+    return raw;
+  }
+  let out = raw.replace(/([{,]\s*)([a-zA-Z_][a-zA-Z0-9_]*)\s*:/g, '$1"$2":');
+  out = out.replace(/(:\s*)(\{(?:[^{}]|\{[^{}]*\})*\}(?:\s*,\s*\{(?:[^{}]|\{[^{}]*\})*\})+)/g, '$1[$2]');
+  return out;
+}
+
 function sanitizeLooseCDATA(text) {
  const raw = toStringSafe(text);
  if (!raw) {
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
CJACK	049e40e5f1	fix: drop obsolete release smoke check	2026-05-02 04:19:23 +08:00
CJACK	b3c54fcf3d	chore: bump version to 4.3.0 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-02 03:55:36 +08:00
CJACK	1c38709d32	feat: add support for parsing loose JSON lists into arrays in tool call parameters	2026-05-02 03:26:43 +08:00
CJACK	28e800c670	chore: remove obsolete planning and gate target files	2026-05-02 02:40:57 +08:00
CJACK	4389e02b29	feat: implement sync.Pool for tiktoken encoding instances to optimize token counting performance	2026-05-02 02:31:24 +08:00
CJACK	e2756f800d	feat: introduce JSON UTF-8 validation middleware and prepend output integrity guard system prompt to messages	2026-05-02 02:22:34 +08:00
CJACK	55abf64717	feat: add model type support for file uploads with automatic resolution and header propagation	2026-05-02 00:55:17 +08:00
CJACK	76ee2faa12	chore: bump version to 4.2.2 and update documentation to reflect improved release workflows, CI dependencies, and project structure	2026-05-01 23:44:07 +08:00
CJACK	0bca6e2cee	feat: implement context cancellation handling for chat and response stream runtimes to ensure clean termination without retries	2026-05-01 23:20:46 +08:00
CJACK.	934b40e572	Merge pull request #392 from wyv202011y/fix/timeout-and-context-cancel fix: increase stream timeout constants for large-context models; guar…	2026-05-01 23:17:31 +08:00
CJACK	dd5a0c5213	refactor: update and standardize current input file continuation prompt instructions	2026-05-01 22:27:59 +08:00
CJACK	43402e7a26	refactor: rename history file constant from HISTORY.txt to DS2API_HISTORY.txt across codebase and tests	2026-05-01 22:05:45 +08:00
CJACK.	6373c001f5	Merge pull request #391 from BigUncle/fix/vercel-admin-history-rewrite Fix: add missing Vercel rewrite rules for admin API routes	2026-05-01 21:45:14 +08:00
BigUncle	3430322e81	docs: add Vercel chat history read-only filesystem troubleshooting	2026-05-01 21:17:52 +08:00
CJACK	df1cfac9bc	refactor: replace history transcript format with numbered sections and rename upload file to HISTORY.txt	2026-05-01 21:15:17 +08:00
王	706e68de23	fix: increase stream timeout constants for large-context models; guard against context-cancelled double-recording - Increase StreamIdleTimeout from 90s to 300s and MaxKeepaliveCount from 10 to 40 to prevent premature stream termination with DeepSeek V4 Pro (~50K token contexts) - Add r.Context().Err() check after ConsumeSSE in empty_retry_runtime (chat + responses) to prevent historySession.error() from overwriting historySession.stopped() when the request context is cancelled References: - MaxKeepaliveCount=10 creates a 50s no-content timeout that kills the stream before DeepSeek V4 Pro can produce its first token with large contexts - Hermes Agent reports 'No response from provider for 180s' because the underlying SSE connection was already terminated by ds2api at 50s - Context cancellation path: OnContextDone -> stopped(), then finalize() with empty output -> retry -> error() overwrites stopped()	2026-05-01 21:11:36 +08:00
BigUncle	83b4c7bcad	fix: add missing Vercel rewrite rules for admin API routes /admin/chat-history, /admin/proxies, /admin/dev/raw-samples, and /admin/dev/captures were falling through to the SPA fallback (/admin/index.html), causing "Unexpected token '<'" JSON parse errors on the frontend.	2026-05-01 20:50:12 +08:00
CJACK.	445c95a4f2	Merge pull request #379 from CJackHwang/dev Merge pull request #377 from CJackHwang/codex/run-all-tests-and-fix-failures Fix failing current-input token accounting test	2026-05-01 16:12:17 +08:00
CJACK	0a6ef8e3f2	fix: remove bufio.Scanner 2MiB line limit for SSE; support quasi_status direct patch Replace bufio.Scanner with bufio.NewReaderSize + ReadBytes('\n') across all SSE read paths to preserve long single-line data (e.g. write_file content). Add quasi_status and auto_continue handling as direct path-based patches in both Go continue observer and Node vercel_stream_impl, mirroring existing batch-patch logic. Add 2MiB+ line throughput tests at every SSE layer. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-01 15:45:17 +08:00
CJACK	fd0ec29991	refactor: generalize DSML tag parsing to tolerate model noise; split tiktoken by build tags Replace hardcoded DSML typo variant lists in Go/Node tool call parsers with generalized prefix consumption that tolerates repeated leading <, repeated DSML prefix noise, and trailing pipe terminators. Split tiktoken-dependent token counting into a build-tagged file for non-cgo platform compatibility. Add /data directory to Dockerfile for bind-mount permissions. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-01 15:17:11 +08:00
CJACK	2671298439	fix: coalesce small stream deltas to prevent character swallowing; add read-tool cache guard Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-01 13:53:27 +08:00
CJACK	92e321fe2c	修复吞字问题	2026-05-01 01:31:48 +08:00
CJACK.	fca8c01397	Merge pull request #385 from ouqiting/fix_chat_history fix: content being overwritten and left empty	2026-05-01 00:21:33 +08:00
ouqiting	667e1e3710	fix: content being overwritten and left empty	2026-04-30 21:36:47 +08:00
CJACK.	4438d03c5c	Merge pull request #377 from CJackHwang/codex/run-all-tests-and-fix-failures Fix failing current-input token accounting test	2026-04-30 02:41:08 +08:00
CJACK.	95b7665643	Merge branch 'dev' into codex/run-all-tests-and-fix-failures	2026-04-30 02:39:18 +08:00
CJACK.	9896b1fc33	Merge pull request #378 from NgoQuocViet2001/ai/openai-root-route-aliases feat(openai): add root route aliases	2026-04-30 02:33:38 +08:00
CJACK.	966f21211d	Fix nil-session guard in chat history test	2026-04-30 02:31:06 +08:00
NgoQuocViet2001	7dc3af40b2	feat(openai): add root route aliases	2026-04-30 01:24:53 +07:00
CJACK.	2f6b5ffda0	Fix current-input token text test expectation	2026-04-30 02:22:17 +08:00
CJACK.	85e256ad4d	Merge pull request #375 from CJackHwang/codex/investigate-data-loss-issue-in-pr-369 sse/parser: treat object-shaped `v` as visible content, preserve INCOMPLETE across omitted status; add tests and samples	2026-04-30 02:14:26 +08:00
CJACK.	7c3ff6ee7e	Merge pull request #374 from shern-point/feat/full-context-file-token-accounting Feat/full context file token accounting	2026-04-30 02:12:55 +08:00
CJACK.	63e62fd1b0	Merge pull request #372 from shern-point/feat/accurate-context-token-length Feat/accurate context token length	2026-04-30 02:11:32 +08:00
CJACK.	483d7af3d2	Merge pull request #373 from NgoQuocViet2001/ai/ds2api-small-regression-fix fix(openai): return 400 for inline file limit	2026-04-30 02:08:22 +08:00
CJACK.	0f89823526	chore(sse): bump client version and refresh longtext stream fixtures	2026-04-30 02:05:45 +08:00
shern-point	6a778e0d35	feat: include inline-uploaded file tokens in context token accounting Track byte sizes of inline-uploaded files during PreprocessInlineFileInputs and convert them to conservative token estimates (bytes/3). RefFileTokens is threaded through StandardRequest into all OpenAI chat/responses usage builders so returned prompt_tokens/input_tokens reflect the full upstream context cost including attached files.	2026-04-30 01:42:51 +08:00
NgoQuocViet2001	9035c350a7	fix(openai): return 400 for inline file limit	2026-04-30 00:35:59 +07:00
shern-point	ba80052a26	fix: count uploaded file content in context token accounting PromptTokenText now reflects the actual downstream context cost: the uploaded IGNORE.txt file content plus the neutral live prompt, instead of only the pre-split prompt text.	2026-04-30 01:12:35 +08:00
shern-point	78fdd63470	feat: add full-context token regression coverage and docs Lock in the current_input_file regression with API-level tests and document that returned context token counts now track full prompt semantics with conservative sizing.	2026-04-30 00:46:06 +08:00
shern-point	4b4f097006	feat: use model-aware prompt counting in Gemini paths Preserve Gemini prompt token text during normalization and remove the hardcoded DeepSeek model from native Gemini usage helpers.	2026-04-30 00:46:05 +08:00
shern-point	d3018c281b	feat: use tokenizer-based counting in Claude token paths Unify Claude count_tokens, legacy stream accounting, and legacy render usage with preserved prompt text so Claude stops falling back to lossy message formatting.	2026-04-30 00:46:04 +08:00
shern-point	415a2359ad	feat: route OpenAI responses usage through preserved prompt text Use the stored full-context prompt text for responses accounting so neutral placeholder prompts do not underreport returned input token counts.	2026-04-30 00:45:31 +08:00
shern-point	f702d45a24	feat: route OpenAI chat usage through preserved prompt text Use the stored full-context prompt text for chat non-stream, stream, and retry accounting so current_input_file no longer shrinks returned prompt token counts.	2026-04-30 00:45:30 +08:00
shern-point	90817cb9e2	feat: apply tokenizer-based counting in OpenAI usage builders Move OpenAI chat and responses usage accounting onto the shared tokenizer-aware counters so prompt and output usage stay model-aware and conservatively sized.	2026-04-30 00:45:29 +08:00
shern-point	b96f736bd2	feat: preserve full prompt text across current_input_file rewrites Keep token accounting tied to the original prompt even after the live prompt is replaced with a neutral placeholder and hidden context file.	2026-04-30 00:45:01 +08:00
shern-point	8ab028c52a	feat: seed PromptTokenText during request normalization Capture the fully built prompt at normalization time for OpenAI and Gemini-compatible requests so usage paths can reuse the original context text.	2026-04-30 00:44:59 +08:00
shern-point	78366afec5	feat: add PromptTokenText to StandardRequest Track a dedicated prompt string for token accounting so later prompt rewrites can keep returning full-context counts.	2026-04-30 00:44:57 +08:00
shern-point	bd41c8a90c	feat: add tokenizer-based token counting utilities Use go-tiktoken with embedded vocabularies for accurate BPE token counting. CountPromptTokens applies conservative padding so returned context token counts stay slightly above the real value instead of undercounting.	2026-04-30 00:44:11 +08:00
CJACK.	bc2a78ae29	Merge pull request #370 from CJackHwang/codex/align-vercel-behavior-with-go fix(vercel): align JS stream parser with Go object-shaped content	2026-04-30 00:12:32 +08:00
CJACK.	192cdf8562	fix(vercel): align JS stream parser with Go object-shaped content	2026-04-29 23:56:16 +08:00
CJACK.	94c1acace5	Merge pull request #369 from CJackHwang/dev Merge pull request #368 from CJackHwang/codex/fix-review-issues-for-pr-#364 Restore thinking fallback for tool-call detection and drop history.txt wrapper tags	2026-04-29 23:42:50 +08:00
CJACK.	273c18ba0f	fix: fallback to /app config when /data is unavailable	2026-04-29 23:40:07 +08:00
CJACK.	ae28e33184	fix: preserve continue state when chunk status is missing	2026-04-29 23:25:18 +08:00
CJACK.	0438ce9a12	Merge pull request #368 from CJackHwang/codex/fix-review-issues-for-pr-#364 Restore thinking fallback for tool-call detection and drop history.txt wrapper tags	2026-04-29 23:07:36 +08:00
CJACK.	af4a067dab	Merge pull request #362 from CJackHwang/codex/fix-issue-based-on-feedback fix(sse): batch tiny stream chunks before emitting	2026-04-29 23:07:03 +08:00
CJACK.	33f6fef015	Fix tool-call fallback on sanitized empty text and remove history wrapper tags	2026-04-29 23:04:45 +08:00
CJACK.	6d3979a1d6	fix(sse): stop scanner sender when stream context cancels	2026-04-29 22:59:22 +08:00
CJACK.	c8922c7a88	Merge pull request #364 from adnxx1wsx/dev Fix stream compatibility and vision model exposure	2026-04-29 22:02:19 +08:00
MiY	241334c658	Fix stream compatibility and vision model exposure	2026-04-29 20:23:13 +08:00
CJACK.	d7e071b24a	Bump version from 4.1.3 to 4.2.0	2026-04-29 19:08:57 +08:00
CJACK.	89225c778e	fix(sse): batch tiny stream chunks before emitting	2026-04-29 18:58:54 +08:00
CJACK.	22160de2c4	Merge pull request #359 from NgoQuocViet2001/ai/ds2api-small-fix fix(openai): keep citation indexes one-based with zero-based references	2026-04-29 18:27:15 +08:00
NgoQuocViet2001	0cbc2c875d	fix(openai): keep citation indexes one-based	2026-04-29 15:43:09 +07:00
CJACK.	a0984ef682	Merge pull request #358 from CJackHwang/revert-356-codex/check-version-update-in-automation-scripts Revert "Verify GHCR latest tag matches release and show version source/latest in dashboard"	2026-04-29 14:49:41 +08:00
CJACK.	babfa973d6	Revert "Verify GHCR latest tag matches release and show version source/latest in dashboard"	2026-04-29 14:47:53 +08:00
CJACK.	ba4071d8b5	Merge pull request #357 from CJackHwang/codex/update-documentation-for-config.json-permissions Return config persistence warning when config path is read-only; default container config to /data/config.json and update docs	2026-04-29 14:18:25 +08:00
CJACK.	e1f8e493d2	fix: add legacy /app/config.json fallback for container upgrades	2026-04-29 14:12:20 +08:00
CJACK.	907104a735	Merge pull request #356 from CJackHwang/codex/check-version-update-in-automation-scripts Verify GHCR latest tag matches release and show version source/latest in dashboard	2026-04-29 13:53:42 +08:00
CJACK.	2c8409dcbb	fix docker defaults to writable /data config path and align docs	2026-04-29 13:46:22 +08:00
CJACK.	5c23261932	webui: show version source and latest release tag in sidebar	2026-04-29 13:45:33 +08:00
CJACK.	d7125ea106	Bump version from 4.1.2 to 4.1.3	2026-04-29 07:55:48 +08:00
CJACK.	929d9a8ef7	Merge pull request #352 from shern-point/fix/tool-string-schema-protection Fix/tool type schema protection	2026-04-29 07:51:21 +08:00
CJACK.	c03f733b83	Merge pull request #353 from Gingiris/docs/add-toc docs: add Table of Contents to README.MD and README.en.md	2026-04-29 07:50:54 +08:00
Gingiris	047fc9bee2	docs: add Table of Contents to README.MD and README.en.md Both READMEs are 400+ lines with 14 top-level sections and multiple subsections but have no navigation aid. Add a Table of Contents at the top of each file to help readers quickly find relevant sections. Changes: - README.MD: add 目录 section with links to all h2/h3 headings - README.en.md: add Table of Contents with matching structure	2026-04-28 12:18:37 -07:00
shern-point	52558838ef	docs: document request-scoped tool schema authority	2026-04-29 02:00:20 +08:00
shern-point	f1926a6ced	fix: normalize Vercel stream tool arguments by schema	2026-04-29 02:00:01 +08:00
shern-point	6e21714e23	test: cover Claude schema-aware tool normalization	2026-04-29 01:59:42 +08:00
shern-point	48c4f0df9f	fix: preserve runtime tool schemas in Claude tool output	2026-04-29 01:59:24 +08:00
shern-point	a550de30af	fix: expand shared tool schema extraction	2026-04-29 01:59:05 +08:00
CJACK.	23422e4a8e	Merge pull request #350 from ouqiting/fix_chat_histroy feat: parse split context files in list view	2026-04-29 01:34:10 +08:00
CJACK.	9c33bed403	Merge pull request #349 from RinZ27/fix-docker-non-root build: improve Docker robustness and fix potential security issues	2026-04-29 01:34:00 +08:00
ouqiting	c81294f1b7	fix(chat-history): support tool turns in parsed HISTORY list view	2026-04-29 01:27:14 +08:00
ouqiting	28d2b0410f	feat: parse split context files in list view	2026-04-29 01:15:29 +08:00
RinZ27	0c782407f5	build: improve Docker robustness and fix potential security issues	2026-04-28 23:49:54 +07:00
CJACK.	27eb73d48b	Merge pull request #346 from CJackHwang/dev Normalize string tool inputs and enhance schema protection	2026-04-28 22:06:41 +08:00
CJACK.	685b5011e4	Merge pull request #343 from livesRan/fix-429Resend-pr 支持 reference 引用标签转链接，并兼容 0 基序号映射	2026-04-28 21:47:15 +08:00
songguoliang	15e9eb3639	支持 reference 引用标签转链接，并兼容 0 基序号映射	2026-04-28 16:42:37 +08:00
CJACK.	f18e6b9b11	Bump version from 4.1.1 to 4.1.2	2026-04-28 16:39:12 +08:00
CJACK.	40ebc8e942	Merge pull request #342 from shern-point/fix/tool-string-schema-protection Fix/tool string schema protection	2026-04-28 16:37:44 +08:00
shern-point	fa3e6d040d	docs: document schema-based string tool coercion Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>	2026-04-28 13:48:04 +08:00
shern-point	458e4469e5	test: cover openai formatter string protection Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>	2026-04-28 13:47:24 +08:00
shern-point	72c8e7e9f9	test: cover responses string-protected tool arguments Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>	2026-04-28 13:46:43 +08:00
shern-point	b9c8e90d98	refactor: thread tool schemas through responses tool outputs Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>	2026-04-28 13:46:06 +08:00
shern-point	36fcba1280	test: cover chat string-protected tool arguments Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>	2026-04-28 13:45:35 +08:00
shern-point	801b5abce3	refactor: thread tool schemas through chat tool outputs Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>	2026-04-28 13:38:57 +08:00
shern-point	206c3d5479	fix: apply string protection in shared tool formatters Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>	2026-04-28 13:27:41 +08:00
shern-point	b2903c35ed	fix: normalize schema-declared string tool inputs Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>	2026-04-28 13:23:58 +08:00
CJACK.	b26dc8b7de	Merge pull request #338 from CJackHwang/dev refactor: update tool call parsing and stream tool sieve logic	2026-04-28 01:48:10 +08:00
CJACK	63271aea8c	refactor: update tool call parsing and stream tool sieve logic Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-28 01:39:32 +08:00
CJACK.	516da04bcd	Merge pull request #337 from CJackHwang/codex/revert-current-input-file-prompt [codex] revert current_input_file prompt refactor	2026-04-28 00:35:36 +08:00
CJACK	9f7b671e5e	Revert "refactor: consolidate current_input_file prompt into BuildOpenAICurrentInputContextPrompt" This reverts commit `d40888496e`.	2026-04-28 00:31:12 +08:00
@@ -1 +1 @@
 .1.1
 .3.0