Merge pull request #379 from CJackHwang/dev

Merge pull request #377 from CJackHwang/codex/run-all-tests-and-fix-failures Fix failing current-input token accounting test
fix: remove bufio.Scanner 2MiB line limit for SSE; support quasi_status direct patch
2026-05-01 23:15:27 +08:00 · 2026-05-01 16:12:17 +08:00 · 2026-05-01 15:45:17 +08:00 · 2026-05-01 15:17:11 +08:00 · 2026-05-01 13:53:27 +08:00 · 2026-05-01 01:31:48 +08:00
130 changed files with 13651 additions and 4544 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -29,6 +29,7 @@ yarn.lock
 pnpm-lock.yaml

 # Build artifacts
+dist/
 *.tsbuildinfo
 .cache/
 .parcel-cache/
--- a/API.en.md
+++ b/API.en.md
@@ -165,6 +165,8 @@ Gemini-compatible clients can also send `x-goog-api-key`, `?key=`, or `?api_key=
 | PUT | `/admin/chat-history/settings` | Admin | Update conversation history retention limit |
 | GET | `/admin/version` | Admin | Check current version and latest Release |

+OpenAI `/v1/*` paths are canonical. For clients configured with the bare DS2API service URL, the same OpenAI handlers are also exposed through root shortcuts: `/models`, `/models/{id}`, `/chat/completions`, `/responses`, `/responses/{response_id}`, `/embeddings`, and `/files`.
+
 ---

 ## Health Endpoints
@@ -199,8 +201,7 @@ No auth required. Returns the currently supported DeepSeek native model list.
    {"id": "deepseek-v4-pro", "object": "model", "created": 1677610602, "owned_by": "deepseek", "permission": []},
    {"id": "deepseek-v4-flash-search", "object": "model", "created": 1677610602, "owned_by": "deepseek", "permission": []},
    {"id": "deepseek-v4-pro-search", "object": "model", "created": 1677610602, "owned_by": "deepseek", "permission": []},
-    {"id": "deepseek-v4-vision", "object": "model", "created": 1677610602, "owned_by": "deepseek", "permission": []},
-    {"id": "deepseek-v4-vision-search", "object": "model", "created": 1677610602, "owned_by": "deepseek", "permission": []}
+    {"id": "deepseek-v4-vision", "object": "model", "created": 1677610602, "owned_by": "deepseek", "permission": []}
  ]
 }
 ```
@@ -224,6 +225,8 @@ Built-in aliases come from `internal/config/models.go`; `config.model_aliases` c
 - Gemini: `gemini-2.5-pro`, `gemini-2.5-flash`, `gemini-pro-vision`
 - Other compatibility families: `llama-*`, `qwen-*`, `mistral-*`, and `command-*` fall back through family heuristics

+Current vision support resolves only to `deepseek-v4-vision` and does not expose a separate `vision-search` variant.
+
 Retired historical families such as `claude-1.*`, `claude-2.*`, `claude-instant-*`, and `gpt-3.5*` are explicitly rejected.

 ### `POST /v1/chat/completions`
@@ -917,12 +920,15 @@ Updates proxy binding for a specific account.
  "message": "API test successful (session creation only)",
  "model": "deepseek-v4-flash",
  "session_count": 0,
-  "config_writable": true
+  "config_writable": true,
+  "config_warning": ""
 }
 ```

 If a `message` is provided, `thinking` may also be included when the upstream response carries reasoning text.

+When the configured file path is not writable (for example, read-only `/app/config.json` inside some containers), login/session testing still proceeds; `config_warning` is returned to indicate token persistence failed and the token is memory-only until restart.
+
 ### `POST /admin/accounts/test-all`

 Optional request field: `model`.
--- a/API.md
+++ b/API.md
@@ -165,6 +165,8 @@ Gemini 兼容客户端还可以使用 `x-goog-api-key`、`?key=` 或 `?api_key=`
 | PUT | `/admin/chat-history/settings` | Admin | 更新对话记录保留条数 |
 | GET | `/admin/version` | Admin | 查询当前版本与最新 Release |

+OpenAI `/v1/*` 仍是规范路径。对于只配置 DS2API 根地址的客户端，同一套 OpenAI handler 也通过根路径快捷路由暴露：`/models`、`/models/{id}`、`/chat/completions`、`/responses`、`/responses/{response_id}`、`/embeddings`、`/files`。
+
 ---

 ## 健康检查
@@ -204,9 +206,7 @@ Gemini 兼容客户端还可以使用 `x-goog-api-key`、`?key=` 或 `?api_key=`
    {"id": "deepseek-v4-pro-search", "object": "model", "created": 1677610602, "owned_by": "deepseek", "permission": []},
    {"id": "deepseek-v4-pro-search-nothinking", "object": "model", "created": 1677610602, "owned_by": "deepseek", "permission": []},
    {"id": "deepseek-v4-vision", "object": "model", "created": 1677610602, "owned_by": "deepseek", "permission": []},
-    {"id": "deepseek-v4-vision-nothinking", "object": "model", "created": 1677610602, "owned_by": "deepseek", "permission": []},
-    {"id": "deepseek-v4-vision-search", "object": "model", "created": 1677610602, "owned_by": "deepseek", "permission": []},
-    {"id": "deepseek-v4-vision-search-nothinking", "object": "model", "created": 1677610602, "owned_by": "deepseek", "permission": []}
+    {"id": "deepseek-v4-vision-nothinking", "object": "model", "created": 1677610602, "owned_by": "deepseek", "permission": []}
  ]
 }
 ```
@@ -232,6 +232,7 @@ Gemini 兼容客户端还可以使用 `x-goog-api-key`、`?key=` 或 `?api_key=`
 - 其他兼容族：`llama-*`、`qwen-*`、`mistral-*`、`command-*` 会按家族启发式回退

 上述 alias 若在请求名后追加 `-nothinking` 后缀，也会映射到对应的强制关闭 thinking 版本。
+当前视觉能力仅对应 `deepseek-v4-vision` / `deepseek-v4-vision-nothinking`，不会解析出独立的 `vision-search` 变体。

 退役历史模型（如 `claude-1.*`、`claude-2.*`、`claude-instant-*`、`gpt-3.5*`）会被显式拒绝。

@@ -934,12 +935,15 @@ data: {"type":"message_stop"}
  "message": "API 测试成功（仅会话创建）",
  "model": "deepseek-v4-flash",
  "session_count": 0,
-  "config_writable": true
+  "config_writable": true,
+  "config_warning": ""
 }
 ```

 如果传入 `message`，还会附带 `thinking`（当上游返回思考内容时）。

+当部署环境配置文件路径不可写（例如容器内默认 `/app/config.json` 只读）时，登录与会话测试仍可继续；此时会返回 `config_warning` 提示 token 仅保存在内存、重启后丢失。
+
 ### `POST /admin/accounts/test-all`

 可选请求字段：`model`
--- a/12
+++ b/12
@@ -28,6 +28,8 @@ FROM debian:bookworm-slim AS runtime-base
 WORKDIR /app
 RUN apt-get update \
    && apt-get install -y --no-install-recommends ca-certificates \
+    && groupadd -r ds2api && useradd -r -g ds2api -d /app -s /sbin/nologin ds2api \
+    && mkdir -p /app/data /data && chown -R ds2api:ds2api /app /data \
    && rm -rf /var/lib/apt/lists/*
 COPY --from=busybox-tools /bin/busybox /usr/local/bin/busybox
 EXPOSE 5001
@@ -36,8 +38,9 @@ CMD ["/usr/local/bin/ds2api"]
 FROM runtime-base AS runtime-from-source
 COPY --from=go-builder /out/ds2api /usr/local/bin/ds2api

-COPY --from=go-builder /app/config.example.json /app/config.example.json
-COPY --from=webui-builder /app/static/admin /app/static/admin
+COPY --from=go-builder --chown=ds2api:ds2api /app/config.example.json /app/config.example.json
+COPY --from=webui-builder --chown=ds2api:ds2api /app/static/admin /app/static/admin
+USER ds2api

 FROM busybox-tools AS dist-extract
 ARG TARGETARCH
@@ -60,7 +63,8 @@ RUN set -eux; \
 FROM runtime-base AS runtime-from-dist
 COPY --from=dist-extract /out/ds2api /usr/local/bin/ds2api

-COPY --from=dist-extract /out/config.example.json /app/config.example.json
-COPY --from=dist-extract /out/static/admin /app/static/admin
+COPY --from=dist-extract --chown=ds2api:ds2api /out/config.example.json /app/config.example.json
+COPY --from=dist-extract --chown=ds2api:ds2api /out/static/admin /app/static/admin
+USER ds2api

 FROM runtime-from-source AS final
--- a/README.MD
+++ b/README.MD
@@ -31,6 +31,30 @@
 >
 > 请勿将本项目用于违反服务条款、协议、法律法规或平台规则的场景。商业使用前请自行确认 `LICENSE`、相关协议以及你是否获得了作者的书面许可。

+## 目录
+
+- [架构概览（摘要）](#架构概览摘要)
+- [核心能力](#核心能力)
+- [平台兼容矩阵](#平台兼容矩阵)
+- [模型支持](#模型支持)
+  - [OpenAI 接口](#openai-接口get-v1models)
+  - [Claude 接口](#claude-接口get-anthropicv1models)
+  - [Gemini 接口](#gemini-接口)
+- [快速开始](#快速开始)
+  - [方式一：下载 Release 构建包](#方式一下载-release-构建包)
+  - [方式二：Docker 运行](#方式二docker-运行)
+  - [方式三：Vercel 部署](#方式三vercel-部署)
+  - [方式四：本地源码运行](#方式四本地源码运行)
+- [配置说明](#配置说明)
+- [鉴权模式](#鉴权模式)
+- [并发模型](#并发模型)
+- [Tool Call 适配](#tool-call-适配)
+- [本地开发抓包工具](#本地开发抓包工具)
+- [文档索引](#文档索引)
+- [测试](#测试)
+- [Release 自动构建（GitHub Actions）](#release-自动构建github-actions)
+- [免责声明](#免责声明)
+
 ## 架构概览（摘要）

 ```mermaid
@@ -107,6 +131,8 @@ flowchart LR
 | WebUI 管理台 | `/admin` 单页应用（中英文双语、深色模式，支持查看服务器端对话记录） |
 | 运维探针 | `GET /healthz`（存活）、`GET /readyz`（就绪） |

+OpenAI `/v1/*` 仍是推荐的规范路径；同时支持 `/models`、`/chat/completions`、`/responses`、`/embeddings`、`/files` 等根路径快捷路由，方便只配置 DS2API 根地址的第三方客户端。
+
 ## 平台兼容矩阵

 | 级别 | 平台 | 当前状态 |
@@ -134,10 +160,9 @@ flowchart LR
 | expert | `deepseek-v4-pro-search-nothinking` | 永久关闭，不受请求参数影响 | ✅ |
 | vision | `deepseek-v4-vision` | 默认开启，可由请求参数控制 | ❌ |
 | vision | `deepseek-v4-vision-nothinking` | 永久关闭，不受请求参数影响 | ❌ |
-| vision | `deepseek-v4-vision-search` | 默认开启，可由请求参数控制 | ✅ |
-| vision | `deepseek-v4-vision-search-nothinking` | 永久关闭，不受请求参数影响 | ✅ |

 除原生模型外，也支持常见 alias 输入（如 `gpt-4.1`、`gpt-5`、`gpt-5-codex`、`o3`、`claude-*`、`gemini-*` 等），但 `/v1/models` 返回的是规范化后的 DeepSeek 原生模型 ID。若 alias 名本身追加 `-nothinking` 后缀，也会映射到对应的强制关思考模型。完整 alias 行为以 [API.md](API.md#模型-alias-解析策略) 和 `config.example.json` 为准。
+当前上游视觉模型只暴露 `vision` 通道，不提供独立的联网搜索视觉变体。

 ### Claude 接口（`GET /anthropic/v1/models`）

@@ -221,6 +246,8 @@ docker-compose logs -f
 ```

 默认 `docker-compose.yml` 会把宿主机 `6011` 映射到容器内的 `5001`。如果你希望直接对外暴露 `5001`，请设置 `DS2API_HOST_PORT=5001`（或者手动调整 `ports` 配置）。
+同时默认把 `./config.json` 挂载到容器 `/data/config.json`，并设置 `DS2API_CONFIG_PATH=/data/config.json`，用于避免 `/app` 只读导致运行时 token 持久化失败。
+镜像会预创建 `/data` 并授权给非 root 的 `ds2api` 用户；如果使用单文件 bind mount，请确保宿主机 `config.json` 对容器用户可读写，例如 `chmod 644 config.json`。

 更新镜像：`docker-compose up -d --build`

@@ -291,7 +318,7 @@ go run ./cmd/ds2api
 - `runtime`：账号并发、队列与 token 刷新策略，可通过 Admin Settings 热更新。
 - `auto_delete.mode`：请求结束后的远端会话清理策略，支持 `none` / `single` / `all`。
 - `history_split`：旧轮次拆分字段，已废弃并忽略，仅保留兼容旧配置。
- `current_input_file`：唯一生效的独立拆分策略；默认开启且阈值为 `0`，触发时将完整上下文合并上传为隐藏上下文文件。
+- `current_input_file`：唯一生效的独立拆分策略；默认开启且阈值为 `0`，触发时将完整上下文合并上传为 `history.txt` 上下文文件。
 - 如果关闭 `current_input_file`，请求会直接透传，不上传拆分上下文文件。
 - `thinking_injection`：默认开启；在最新 user 消息末尾追加思考增强提示词，提高高强度推理与工具调用前的思考稳定性；`prompt` 留空时使用内置默认提示词。

--- a/README.en.md
+++ b/README.en.md
@@ -28,6 +28,30 @@ Documentation entry: [Docs Index](docs/README.md) / [Architecture](docs/ARCHITEC
 >
 > Do not use this project in ways that violate service terms, agreements, laws, or platform rules. Before any commercial use, review the `LICENSE`, the relevant terms, and confirm that you have the author's written permission.

+## Table of Contents
+
+- [Architecture Overview (Summary)](#architecture-overview-summary)
+- [Key Capabilities](#key-capabilities)
+- [Platform Compatibility Matrix](#platform-compatibility-matrix)
+- [Model Support](#model-support)
+  - [OpenAI Endpoint](#openai-endpoint-get-v1models)
+  - [Claude Endpoint](#claude-endpoint-get-anthropicv1models)
+  - [Gemini Endpoint](#gemini-endpoint)
+- [Quick Start](#quick-start)
+  - [Option 1: Download Release Binaries](#option-1-download-release-binaries)
+  - [Option 2: Docker / GHCR](#option-2-docker--ghcr)
+  - [Option 3: Vercel](#option-3-vercel)
+  - [Option 4: Local Run](#option-4-local-run)
+- [Configuration](#configuration)
+- [Authentication Modes](#authentication-modes)
+- [Concurrency Model](#concurrency-model)
+- [Tool Call Adaptation](#tool-call-adaptation)
+- [Local Dev Packet Capture](#local-dev-packet-capture)
+- [Documentation Index](#documentation-index)
+- [Testing](#testing)
+- [Release Artifact Automation (GitHub Actions)](#release-artifact-automation-github-actions)
+- [Disclaimer](#disclaimer)
+
 ## Architecture Overview (Summary)

 ```mermaid
@@ -104,6 +128,8 @@ For the full module-by-module architecture and directory responsibilities, see [
 | WebUI Admin Panel | SPA at `/admin` (bilingual Chinese/English, dark mode, with server-side conversation history) |
 | Health Probes | `GET /healthz` (liveness), `GET /readyz` (readiness) |

+OpenAI `/v1/*` routes remain canonical, and DS2API also accepts root shortcuts such as `/models`, `/chat/completions`, `/responses`, `/embeddings`, and `/files` for clients configured with the bare service URL.
+
 ## Platform Compatibility Matrix

 | Tier | Platform | Status |
@@ -126,9 +152,9 @@ For the full module-by-module architecture and directory responsibilities, see [
 | default | `deepseek-v4-flash-search` | enabled by default, request-controlled | ✅ |
 | expert | `deepseek-v4-pro-search` | enabled by default, request-controlled | ✅ |
 | vision | `deepseek-v4-vision` | enabled by default, request-controlled | ❌ |
-| vision | `deepseek-v4-vision-search` | enabled by default, request-controlled | ✅ |

 Besides native IDs, DS2API also accepts common aliases as input (for example `gpt-4.1`, `gpt-5`, `gpt-5-codex`, `o3`, `claude-*`, `gemini-*`), but `/v1/models` returns normalized DeepSeek native model IDs. The complete alias behavior is documented in [API.en.md](API.en.md#model-alias-resolution) and `config.example.json`.
+Current upstream vision support exposes only the `vision` lane and does not provide a separate search-enabled vision variant.

 ### Claude Endpoint (`GET /anthropic/v1/models`)

@@ -209,6 +235,7 @@ docker-compose up -d
 ```

 The default `docker-compose.yml` uses `ghcr.io/cjackhwang/ds2api:latest` and maps host port `6011` to container port `5001`. If you want `5001` exposed directly, set `DS2API_HOST_PORT=5001` (or adjust the `ports` mapping).
+It also mounts `./config.json` to `/data/config.json` and sets `DS2API_CONFIG_PATH=/data/config.json` by default, which avoids runtime token persistence failures caused by read-only `/app`.

 Rebuild after updates: `docker-compose up -d --build`

@@ -279,7 +306,7 @@ Common fields:
 - `runtime`: account concurrency, queueing, and token refresh behavior, hot-reloadable via Admin Settings.
 - `auto_delete.mode`: remote session cleanup after each request, supporting `none` / `single` / `all`.
 - `history_split`: legacy multi-turn history split field, now ignored and kept only for backward-compatible config loading.
- `current_input_file`: the only active split mode; it is enabled by default and uploads the full context as a hidden context file once the character threshold is reached.
+- `current_input_file`: the only active split mode; it is enabled by default and uploads the full context as a `history.txt` context file once the character threshold is reached.
 - If you turn off `current_input_file`, requests pass through directly without uploading any split context file.

 For the full environment variable list, see [docs/DEPLOY.en.md](docs/DEPLOY.en.md). For auth behavior, see [API.en.md](API.en.md#authentication).
--- a/2
+++ b/2
@@ -1 +1 @@
-4.1.2
+4.2.1
--- a/cmd/ds2api/main.go
+++ b/cmd/ds2api/main.go
@@ -35,8 +35,9 @@ func main() {
 	}

 	srv := &http.Server{
-		Addr:    "0.0.0.0:" + port,
-		Handler: app.Router,
+		Addr:              "0.0.0.0:" + port,
+		Handler:           app.Router,
+		ReadHeaderTimeout: 5 * time.Second,
 	}
 	localURL := fmt.Sprintf("http://127.0.0.1:%s", port)
 	lanIP := detectLANIPv4()
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -9,8 +9,9 @@ services:
      # Host port is configurable via DS2API_HOST_PORT; container port stays fixed at 5001.
      - "${DS2API_HOST_PORT:-6011}:5001"
    volumes:
-      - ./config.json:/app/config.json    # 配置文件
+      - ./config.json:/data/config.json   # 配置文件（持久化推荐路径）
    environment:
      - TZ=Asia/Shanghai
      - LOG_LEVEL=INFO
      - DS2API_ADMIN_KEY=${DS2API_ADMIN_KEY:-ds2api}
+      - DS2API_CONFIG_PATH=/data/config.json
--- a/docs/DEPLOY.en.md
+++ b/docs/DEPLOY.en.md
@@ -130,6 +130,9 @@ docker-compose logs -f
 ```

 The default `docker-compose.yml` directly uses `ghcr.io/cjackhwang/ds2api:latest` and maps host port `6011` to container port `5001`. If you want `5001` exposed directly, set `DS2API_HOST_PORT=5001` (or adjust the `ports` mapping).
+The compose template also defaults to `DS2API_CONFIG_PATH=/data/config.json` with `./config.json:/data/config.json` mounted, so deployments avoid read-only `/app` persistence issues by default.
+The image pre-creates `/data` and grants it to the non-root `ds2api` user. If you bind-mount a single host file, make sure `config.json` is readable/writable by the container user, for example with `chmod 644 config.json`; otherwise Linux UID/GID mismatches can still cause `open /data/config.json: permission denied`.
+Compatibility note: when `DS2API_CONFIG_PATH` is unset and runtime base dir is `/app`, newer versions prefer `/data/config.json`; if that file is missing but legacy `/app/config.json` exists, DS2API automatically falls back to the legacy path to avoid post-upgrade config loss.

 If you want a pinned version instead of `latest`, you can also pull a specific tag directly:

@@ -195,6 +198,11 @@ Notes:

 - **Port**: DS2API listens on `5001` by default; the template sets `PORT=5001`.
 - **Persistent config**: the template mounts `/data` and sets `DS2API_CONFIG_PATH=/data/config.json`. After importing config in Admin UI, it will be written and persisted to this path.
+- **`open /app/config.json: permission denied`**: this means the instance is trying to persist runtime tokens to a read-only path (commonly `/app` inside the image).  
+  Recommended handling:
+  1. Set a writable path explicitly: `DS2API_CONFIG_PATH=/data/config.json` (and mount a persistent volume at `/data`);
+  2. If you bootstrap with `DS2API_CONFIG_JSON` and do not need runtime writeback, keep env-backed mode (`DS2API_ENV_WRITEBACK` disabled);
+  3. In current versions, login/session tests continue even if persistence fails; Admin API returns a warning that token persistence failed and token is memory-only until restart.
 - **Build version**: Zeabur / regular `docker build` does not require `BUILD_VERSION` by default. The image prefers that build arg when provided, and automatically falls back to the repo-root `VERSION` file when it is absent.
 - **First login**: after deployment, open `/admin` and login with `DS2API_ADMIN_KEY` shown in Zeabur env/template instructions (recommended: rotate to a strong secret after first login).

--- a/docs/DEPLOY.md
+++ b/docs/DEPLOY.md
@@ -130,6 +130,9 @@ docker-compose logs -f
 ```

 默认 `docker-compose.yml` 直接使用 `ghcr.io/cjackhwang/ds2api:latest`，并把宿主机 `6011` 映射到容器内的 `5001`。如果你希望直接对外暴露 `5001`，请设置 `DS2API_HOST_PORT=5001`（或者手动调整 `ports` 配置）。
+Compose 模板还会默认设置 `DS2API_CONFIG_PATH=/data/config.json` 并挂载 `./config.json:/data/config.json`，优先避免 `/app` 只读带来的配置持久化问题。
+镜像内会预创建 `/data` 并授权给非 root 的 `ds2api` 用户；如果你使用 bind mount 单文件，请确保宿主机 `config.json` 至少可被容器用户读取/写入，例如 `chmod 644 config.json`，否则 Linux UID/GID 不一致时仍可能出现 `open /data/config.json: permission denied`。
+兼容说明：若未设置 `DS2API_CONFIG_PATH` 且运行目录是 `/app`，新版本会优先使用 `/data/config.json`；当该文件不存在但检测到历史 `/app/config.json` 时，会自动回退读取旧路径，避免升级后“配置丢失”。

 如需固定版本，也可以直接拉取指定 tag：

@@ -195,6 +198,11 @@ healthcheck:

 - **端口**：服务默认监听 `5001`，模板会固定设置 `PORT=5001`。
 - **配置持久化**：模板挂载卷 `/data`，并设置 `DS2API_CONFIG_PATH=/data/config.json`；在管理台导入配置后，会写入并持久化到该路径。
+- **`open /app/config.json: permission denied`**：说明当前实例在尝试把运行时 token 持久化到只读路径（常见于镜像内 `/app`）。  
+  处理建议：
+  1. 显式设置可写路径：`DS2API_CONFIG_PATH=/data/config.json`（并挂载持久卷到 `/data`）；  
+  2. 若你使用 `DS2API_CONFIG_JSON` 启动且不需要运行时落盘，可保持环境变量模式（`DS2API_ENV_WRITEBACK` 关闭）；  
+  3. 最新版本中，即使持久化失败，登录/会话测试仍会继续，仅提示“token 未持久化（重启后丢失）”。
 - **构建版本号**：Zeabur / 普通 `docker build` 默认不需要传 `BUILD_VERSION`；镜像会优先使用该构建参数，未提供时自动回退到仓库根目录的 `VERSION` 文件。
 - **首次登录**：部署完成后访问 `/admin`，使用 Zeabur 环境变量/模板指引中的 `DS2API_ADMIN_KEY` 登录（建议首次登录后自行更换为强密码）。

--- a/docs/DeepSeekSSE行为结构说明-2026-04-05.md
+++ b/docs/DeepSeekSSE行为结构说明-2026-04-05.md
@@ -309,7 +309,18 @@ parse SSE block
 - 新模型可能增加新的 `p` 路径。
 - 新版本可能增加新的 fragment.type。
 - `CONTENT_FILTER` 的终态模板内容可能变化。
- 自动续写相关状态（如 `INCOMPLETE` / `AUTO_CONTINUE`）当前主要来自实测与实现兼容逻辑，后续字段形态仍可能变化。
+- 自动续写相关状态（如 `INCOMPLETE` / `AUTO_CONTINUE`）当前主要来自实测与实现兼容逻辑，后续字段形态仍可能变化。当前实现不会仅因早期 `WIP` 状态就自动继续；只有显式 `INCOMPLETE` 或 `auto_continue` 信号才会触发 continue。
 - 解析器应当对未知字段、未知路径、未知事件保持容忍。

 如果你要把这份说明用于实际开发，建议同时保留原始流样本、回放脚本和回归测试，不要只依赖本文。
+
+## 2026-04-29 最近线上样本增量观察
+
+基于 `longtext-deepseek-v4-flash-20260429` 与 `longtext-deepseek-v4-pro-20260429` 两个真实账号长文本样本，近期格式变化要点如下：
+
+1. `data:` 事件中仍大量出现 `{"v":"..."}` 的无路径增量（`p` 缺失），解析器必须把空路径视为可见正文候选，而不能只依赖 `response/content`。
+2. 对象形态 `v`（如 `{"text":"..."}` / `{"content":"..."}`）仍会出现，且可能与无路径 chunk 混用；仅按字符串处理会导致正文丢块。
+3. 多轮 continuation 场景下，后续 chunk 可能不再重复显式 `status`，状态机需要保留上一轮 `INCOMPLETE` 语义直到出现终态。
+4. 2026-04-29 起客户端头部版本基线上调到 `x-client-version: 2.0.3`，否则部分账号会出现上游行为不一致（包括空输出与补轮异常）。
+
+建议：新增样本默认回放应优先覆盖「长文本 + 多轮 + 无路径 chunk」组合，避免只用短样本导致回归漏检。
--- a/docs/prompt-compatibility.md
+++ b/docs/prompt-compatibility.md
@@ -98,13 +98,16 @@ DS2API 当前的核心思路，不是把客户端传来的 `messages`、`tools`
 - `prompt` 才是对话上下文主载体。
 - `ref_file_ids` 只承载文件引用，不承载普通文本消息。
 - `tools` 不会作为“原生工具 schema”直接下发给下游，而是被改写进 `prompt`。
+- 对外返回给客户端的 `prompt_tokens` / `input_tokens` / `promptTokenCount` 不再按“最后一条消息”或字符粗估近似返回，而是基于**完整上下文 prompt**做 tokenizer 计数；为了避免上下文实际超限但客户端误以为还能塞下，请求侧上下文 token 会额外保守上浮一点，宁可略大也不低估。
 - 当前 `/v1/chat/completions` 业务路径仍是“每次请求新建一个远端 `chat_session_id`，并默认发送 `parent_message_id: null`”；因此 DS2API 对外默认表现为“新会话 + prompt 拼历史”，而不是复用 DeepSeek 原生会话树。
 - 但 DeepSeek 远端本身支持同一 `chat_session_id` 的跨轮次持续对话。2026-04-27 已用项目内现有 DeepSeek client 做过一次不改业务代码的双轮实测：同一 `chat_session_id` 下，第 1 轮返回 `request_message_id=1` / `response_message_id=2` / 文本 `SESSION_TEST_ONE`；第 2 轮重新获取一次 PoW，并发送 `parent_message_id=2` 后，成功返回 `request_message_id=3` / `response_message_id=4` / 文本 `SESSION_TEST_TWO`。这说明“同远端会话持续聊天”能力存在，且每轮需要携带正确的 parent/message 链接信息，同时重新获取对应轮次可用的 PoW。
 - OpenAI Chat / Responses 原生走统一 OpenAI 标准化与 DeepSeek payload 组装；Claude / Gemini 会尽量复用 OpenAI prompt/tool 语义，其中 Gemini 直接复用 `promptcompat.BuildOpenAIPromptForAdapter`，Claude 消息接口在可代理场景会转换为 OpenAI chat 形态再执行。
 - 客户端传入的 thinking / reasoning 开关会被归一到下游 `thinking_enabled`。Gemini `generationConfig.thinkingConfig.thinkingBudget` 会翻译成同一套 thinking 开关；关闭时即使上游返回 `response/thinking_content`，兼容层也不会把它当作可见正文输出。若最终解析出的模型名带 `-nothinking` 后缀，则会无条件强制关闭 thinking，优先级高于请求体中的 `thinking` / `reasoning` / `reasoning_effort`。Claude surface 在流式请求且未显式声明 `thinking` 时，仍按 Anthropic 语义默认关闭；但在非流式代理场景，兼容层会内部开启一次下游 thinking，用于捕获“正文为空、工具调用落在 thinking 里”的情况，随后在回包前剥离用户不可见的 thinking block。
- 对 OpenAI Chat / Responses 的非流式收尾，如果最终可见正文为空，兼容层会优先尝试把思维链中的独立 DSML / XML 工具块当作真实工具调用解析出来。流式链路也会在收尾阶段做同样的 fallback 检测，但不会因为思维链内容去中途拦截或改写流式输出；thinking / reasoning 增量仍按原样先发，只有在结束收尾时才可能补发最终工具调用结果。补发结果会作为本轮 assistant 的结构化 `tool_calls` / `function_call` 输出返回，而不是塞进 `content` 文本；如果客户端没有开启 thinking / reasoning，思维链只用于检测，不会作为 `reasoning_content` 或可见正文暴露。只有正文为空且思维链里也没有可执行工具调用时，才继续按空回复错误处理。
+- 对 OpenAI Chat / Responses 的非流式收尾，如果最终可见正文为空，兼容层会优先尝试把思维链中的独立 DSML / XML 工具块当作真实工具调用解析出来。流式链路也会在收尾阶段做同样的 fallback 检测，但不会因为思维链内容去中途拦截或改写流式输出；真正的工具识别始终基于原始上游文本，而不是基于“已经做过可见输出清洗”的版本，因此即使最终可见层会剥离完整 leaked DSML / XML `tool_calls` wrapper、并抑制全空参数或无效 wrapper 块，也不会影响真实工具调用转成结构化 `tool_calls` / `function_call`。补发结果会作为本轮 assistant 的结构化 `tool_calls` / `function_call` 输出返回，而不是塞进 `content` 文本；如果客户端没有开启 thinking / reasoning，思维链只用于检测，不会作为 `reasoning_content` 或可见正文暴露。只有正文为空且思维链里也没有可执行工具调用时，才继续按空回复错误处理。
 - OpenAI Chat / Responses 的空回复错误处理之前会默认做一次内部补偿重试：第一次上游完整结束后，如果最终可见正文为空、没有解析到工具调用、也没有已经向客户端流式发出工具调用，并且终止原因不是 `content_filter`，兼容层会复用同一个 `chat_session_id`、账号、token 与工具策略，把原始 completion `prompt` 追加固定后缀 `Previous reply had no visible output. Please regenerate the visible final answer or tool call now.` 后重新提交一次。重试遵循 DeepSeek 多轮对话协议：从第一次上游 SSE 流中提取 `response_message_id`，并在重试 payload 中设置 `parent_message_id` 为该值，使重试成为同一会话的后续轮次而非断裂的根消息；同时重新获取一次 PoW（若 PoW 获取失败则回退到原始 PoW）。该重试不会重新标准化消息、不会新建 session、不会切换账号，也不会向流式客户端插入重试标记；第二次 thinking / reasoning 会按正常增量直接接到第一次之后，并继续使用 overlap trim 去重。若第二次仍为空，终端错误码仍保持现有 `upstream_empty_output`；若任一尝试触发空 `content_filter`，不做补偿重试并保持 `content_filter` 错误。JS Vercel 运行时同样设置 `parent_message_id`，但因无法直接调用 PoW API 而复用原始 PoW。

+- OpenAI Chat / Responses 在最终可见正文渲染阶段，会把 DeepSeek 搜索返回中的 `[citation:N]` / `[reference:N]` 标记替换成对应 Markdown 链接。`citation` 标记按一基序号解析；`reference` 标记只有在同一段正文中出现 `[reference:0]`（允许冒号后有空格）时才按零基序号映射，并且不会影响同段正文里的 `citation` 标记。
+
 ## 5. prompt 是怎么拼出来的

 OpenAI Chat / Responses 在标准化后、current input file 之前，会默认执行 `thinking_injection` 增强。它参考 DeepSeek V4 “把控制指令放在 user 消息末尾更稳定”的用法，在最新 user message 后追加思考增强提示词。当前内置默认提示词以 `Reasoning Effort: Absolute maximum with no shortcuts permitted.` 开头，并继续要求模型充分分解问题、覆盖潜在路径与边界条件、把完整推演过程显式写出。该开关默认启用，可通过 `thinking_injection.enabled=false` 关闭；也可以通过 `thinking_injection.prompt` 自定义提示词，留空时使用内置默认提示词。
@@ -153,9 +156,12 @@ OpenAI Chat / Responses 在标准化后、current input file 之前，会默认
 工具调用正例现在优先示范官方 DSML 风格：`<|DSML|tool_calls>` → `<|DSML|invoke name="...">` → `<|DSML|parameter name="...">`。
 兼容层仍接受旧式纯 `<tool_calls>` wrapper，但提示词会优先要求模型输出官方 DSML 标签，并强调不能只输出 closing wrapper 而漏掉 opening tag。需要注意：这是“兼容 DSML 外壳，内部仍以 XML 解析语义为准”，不是原生 DSML 全链路实现；DSML 标签会在解析入口归一化回现有 XML 标签后继续走同一套 parser。
 数组参数使用 `<item>...</item>` 子节点表示；当某个参数体只包含 item 子节点时，Go / Node 解析器会把它还原成数组，避免 `questions` / `options` 这类 schema 中要求 array 的参数被误解析成 `{ "item": ... }` 对象。若模型把完整结构化 XML fragment 误包进 CDATA，兼容层会在保护 `content` / `command` 等原文字段的前提下，尝试把非原文字段中的 CDATA XML fragment 还原成 object / array。不过，如果 CDATA 只是单个平面的 XML/HTML 标签，例如 `<b>urgent</b>` 这种行内标记，兼容层会保留原始字符串，不会强行升成 object / array；只有明显表示结构的 CDATA 片段，例如多兄弟节点、嵌套子节点或 `item` 列表，才会触发结构化恢复。
+Go 侧读取 DeepSeek SSE 时不再依赖 `bufio.Scanner` 的固定 2MiB 单行上限；当写文件类工具把很长的 `content` 放在单个 `data:` 行里返回时，非流式收集、流式解析和 auto-continue 透传都会保留完整行，再进入同一套工具解析与序列化流程。
 在 assistant 最终回包阶段，如果某个 tool 参数在声明 schema 中明确是 `string`，兼容层会在把解析后的 `tool_calls` / `function_call` 重新序列化成 OpenAI / Responses / Claude 可见参数前，递归把该路径上的 number / bool / object / array 统一转成字符串；其中 object / array 会压成紧凑 JSON 字符串。这个保护只对 schema 明确声明为 string 的路径生效，不会改写本来就是 `number` / `boolean` / `object` / `array` 的参数。这样可以兼容 DeepSeek 输出了结构化片段、但上游客户端工具 schema 又严格要求字符串参数的场景（例如 `content`、`prompt`、`path`、`taskId` 等）。
+工具 schema 的权威来源始终是**当前请求实际携带的 schema**，而不是同名工具在其他 runtime（Claude Code / OpenCode / Codex 等）里的默认印象。兼容层现在会同时兼容 OpenAI 风格 `function.parameters`、直接工具对象上的 `parameters` / `input_schema`、以及 camelCase 的 `inputSchema` / `schema`，并在最终输出阶段按这份请求内 schema 决定是保留 array/object，还是仅对明确声明为 `string` 的路径做字符串化。该规则同样适用于 Claude 的流式收尾和 Vercel Node 流式 tool-call formatter，避免不同 runtime 因 schema shape 差异而出现同名工具参数类型漂移。
 正例中的工具名只会来自当前请求实际声明的工具；如果当前请求没有足够的已知工具形态，就省略对应的单工具、多工具或嵌套示例，避免把不可用工具名写进 prompt。
 对执行类工具，脚本内容必须进入执行参数本身：`Bash` / `execute_command` 使用 `command`，`exec_command` 使用 `cmd`；不要把脚本示范成 `path` / `content` 文件写入参数。
+如果当前请求声明了 `Read` / `read_file` 这类读取工具，兼容层会额外注入一条 read-tool cache guard：当读取结果只表示“文件未变更 / 已在历史中 / 请引用先前上下文 / 没有正文内容”时，模型必须把它视为内容不可用，不能反复调用同一个无正文读取；应改为请求完整正文读取能力，或向用户说明需要重新提供文件内容。这个约束只缓解客户端缓存返回空内容导致的死循环，DS2API 不会也无法凭空恢复客户端本地文件正文。

 OpenAI 路径实现：
 [internal/promptcompat/tool_prompt.go](../internal/promptcompat/tool_prompt.go)
@@ -243,9 +249,10 @@ OpenAI 文件相关实现：

 兼容层现在只保留 `current_input_file` 这一种拆分方式；旧的 `history_split` 已废弃，只保留为兼容旧配置的字段，不再参与请求处理。

- `current_input_file` 默认开启；它用于把“完整上下文”合并进隐藏上下文文件。当最新 user turn 的纯文本长度达到 `current_input_file.min_chars`（默认 `0`）时，兼容层会上传一个文件名为 `IGNORE.txt` 的上下文文件，并在 live prompt 中只保留一个中性的 user 消息要求模型直接回答最新请求，不再暴露文件名或要求模型读取本地文件。
+- `current_input_file` 默认开启；它用于把“完整上下文”合并进 `history.txt` 上下文文件。当最新 user turn 的纯文本长度达到 `current_input_file.min_chars`（默认 `0`）时，兼容层会上传一个文件名为 `history.txt` 的上下文文件，并在 live prompt 中只保留一个中性的 user 消息要求模型直接回答最新请求，不再暴露文件名或要求模型读取本地文件。
 - 如果 `current_input_file.enabled=false`，请求会直接透传，不上传任何拆分上下文文件。
 - 旧的 `history_split.enabled` / `history_split.trigger_after_turns` 会被读取进配置对象以保持兼容，但不会触发拆分上传，也不会影响 `current_input_file` 的默认开启。
+- 即使触发 `current_input_file` 后 live prompt 被缩短，对客户端回包里的上下文 token 统计，仍会沿用**拆分前的完整 prompt 语义**做计数，而不是按缩短后的占位 prompt 计算；否则会把真实上下文显著算小。

 相关实现：

@@ -256,16 +263,11 @@ OpenAI 文件相关实现：
 - 旧历史拆分兼容壳：
  [internal/httpapi/openai/history/history_split.go](../internal/httpapi/openai/history/history_split.go)

-当前输入转文件启用并触发时，上传文件的真实文件名是 `IGNORE.txt`，文件内容是完整 `messages` 上下文；它仍会先用 OpenAI 消息标准化和 DeepSeek 角色标记序列化，再包进 `IGNORE` 文件边界里：
+当前输入转文件启用并触发时，上传文件的真实文件名是 `history.txt`，文件内容是完整 `messages` 上下文；它仍会先用 OpenAI 消息标准化和 DeepSeek 角色标记序列化，并直接作为 `history.txt` 的纯文本内容上传（不再注入文件边界标签）：

 ```text
-[uploaded filename]: IGNORE.txt
-[file content end]
-
+[uploaded filename]: history.txt
 <｜begin▁of▁sentence｜><｜System｜>...<｜User｜>...<｜Assistant｜>...<｜Tool｜>...<｜User｜>...
-
-[file name]: IGNORE
-[file content begin]
 ```

 开启后，请求的 live prompt 不再直接内联完整上下文，而是保留一个 user role 的短提示，提示模型基于已提供上下文直接回答最新请求；上传后的 `file_id` 会进入 `ref_file_ids`。
@@ -332,7 +334,7 @@ OpenAI 文件相关实现：

 - 大部分结构化语义被压进 `prompt`
 - 文件保持文件
- 需要时把完整上下文拆进隐藏上下文文件
+- 需要时把完整上下文拆进 `history.txt` 上下文文件

 ## 12. 修改时必须同步本文档的场景

@@ -345,7 +347,7 @@ OpenAI 文件相关实现：
 - tool result 注入方式变更
 - tool prompt 模板或 tool_choice 约束变更
 - inline 文件上传 / 文件引用收集规则变更
- current input file 触发条件、上传格式、`IGNORE` 包装格式变更
+- current input file 触发条件、上传格式、`history.txt` 包装格式变更
 - 旧 `history_split` 兼容逻辑的读取、忽略或退化行为变更
 - completion payload 字段语义变更
 - Claude / Gemini 对这套统一语义的复用关系变更
--- a/docs/toolcall-semantics.md
+++ b/docs/toolcall-semantics.md
@@ -26,7 +26,7 @@
 </tool_calls>
 ```

-这不是原生 DSML 全链路实现。DSML 只作为 prompt 外壳和解析入口别名；进入 parser 前会被归一化成 `<tool_calls>` / `<invoke>` / `<parameter>`，内部仍以现有 XML 解析语义为准。
+这不是原生 DSML 全链路实现。DSML 主要用于让模型有意识地输出协议标识，隔离普通 XML 语义；进入 parser 前会按固定本地标签名归一化成 `<tool_calls>` / `<invoke>` / `<parameter>`，内部仍以现有 XML 解析语义为准。

 约束：

@@ -39,7 +39,8 @@
 兼容修复：

 - 如果模型漏掉 opening wrapper，但后面仍输出了一个或多个 invoke 并以 closing wrapper 收尾，Go 解析链路会在解析前补回缺失的 opening wrapper。
- 如果模型把 DSML 标签里的分隔符 `|` 写漏成空格（例如 `<|DSML tool_calls>` / `<|DSML invoke>` / `<|DSML parameter>`，或无 leading pipe 的 `<DSML tool_calls>` 形态），或把 `DSML` 与工具标签名直接黏连（例如 `<DSMLtool_calls>` / `<DSMLinvoke>` / `<DSMLparameter>`），或把最前面的 pipe 误写成全宽竖线（例如 `<｜DSML|tool_calls>` / `<｜DSML|invoke>` / `<｜DSML|parameter>`），Go / Node 会在固定工具标签名范围内归一化；相似但非工具标签名（如 `tool_calls_extra`）仍按普通文本处理。
+- Go / Node 解析层不再枚举每一种 DSML typo。它会把工具标签名前的 `DSML`、管道符 `|` / `｜`、空白、重复 leading `<` 视为可容忍的协议噪声，然后只匹配固定本地标签名 `tool_calls` / `invoke` / `parameter`。例如 `<DSML|tool_calls>`、`<<|DSML|tool_calls>`、`<|DSML tool_calls>`、`<DSMLtool_calls>`、`<<DSML|DSML|tool_calls>` 都会归一化；相似但非固定标签名（如 `tool_calls_extra`）仍按普通文本处理。
+- 如果模型在固定工具标签名后多输出一个尾部管道符，例如 `<|DSML|tool_calls|` / `<|DSML|invoke|` / `<|DSML|parameter|`，兼容层会把这个尾部 `|` 当作异常标签终止符并补齐缺失的 `>`；如果后面已经有 `>`，也会消费这个多余 `|` 后再归一化。
 - 这是一个针对常见模型失误的窄修复，不改变推荐输出格式；prompt 仍要求模型直接输出完整 DSML 外壳。
 - 裸 `<invoke ...>` / `<parameter ...>` 不会被当成“已支持的工具语法”；只有 `tool_calls` wrapper 或可修复的缺失 opening wrapper 才会进入工具调用路径。

@@ -53,7 +54,7 @@

 在流式链路中（Go / Node 一致）：

- DSML `<|DSML|tool_calls>` wrapper、兼容变体（`<dsml|tool_calls>`、`<｜tool_calls>`、`<|tool_calls>`、`<｜DSML|tool_calls>`）、窄容错空格分隔形态（如 `<|DSML tool_calls>`）、黏连形态（如 `<DSMLtool_calls>`）和 canonical `<tool_calls>` wrapper 都会进入结构化捕获
+- DSML `<|DSML|tool_calls>` wrapper、基于固定本地标签名的 DSML 噪声容错形态、尾部管道符形态（如 `<|DSML|tool_calls|`）和 canonical `<tool_calls>` wrapper 都会进入结构化捕获
 - 如果流里直接从 invoke 开始，但后面补上了 closing wrapper，Go 流式筛分也会按缺失 opening wrapper 的修复路径尝试恢复
 - 已识别成功的工具调用不会再次回流到普通文本
 - 不符合新格式的块不会执行，并继续按原样文本透传
@@ -61,6 +62,7 @@
 - 支持嵌套围栏（如 4 反引号嵌套 3 反引号）和 CDATA 内围栏保护
 - 如果模型把 `<![CDATA[` 打开后却没有闭合，流式扫描阶段仍会保守地继续缓冲，不会误把 CDATA 里的示例 XML 当成真实工具调用；在最终 parse / flush 恢复阶段，会对这类 loose CDATA 做窄修复，尽量保住外层已完整包裹的真实工具调用
 - 当文本中 mention 了某种标签名（如 `<dsml|tool_calls>` 或 Markdown inline code 里的 `<|DSML|tool_calls>`）而后面紧跟真正工具调用时，sieve 会跳过不可解析的 mention 候选并继续匹配后续真实工具块，不会因 mention 导致工具调用丢失，也不会截断 mention 后的正文
+- Go 侧 SSE 读取不再使用 `bufio.Scanner` 的固定 token 上限；单个 `data:` 行中包含很长的写文件参数时，非流式收集、流式解析与 auto-continue 透传都应保留完整行，再交给 tool parser 处理

 另外，`<parameter>` 的值如果本身是合法 JSON 字面量，也会按结构化值解析，而不是一律保留为字符串。例如 `123`、`true`、`null`、`[1,2]`、`{"a":1}` 都会还原成对应的 number / boolean / null / array / object。
 结构化 XML 参数也会还原为 JSON 结构：如果参数体只包含一个或多个 `<item>...</item>` 子节点，会输出数组；嵌套对象里的 item-only 字段也同样按数组处理。例如 `<parameter name="questions"><item><question>...</question></item></parameter>` 会输出 `{"questions":[{"question":"..."}]}`，而不是 `{"questions":{"item":...}}`。
@@ -94,7 +96,7 @@ node --test tests/node/stream-tool-sieve.test.js

 - DSML `<|DSML|tool_calls>` wrapper 正常解析
 - legacy canonical `<tool_calls>` wrapper 正常解析
- 别名变体（`<dsml|tool_calls>`、`<｜tool_calls>`、`<|tool_calls>`）、DSML 空格分隔 typo（如 `<|DSML tool_calls>`）和黏连 typo（如 `<DSMLtool_calls>`）正常解析
+- 固定本地标签名的 DSML 噪声容错形态（如 `<DSML|tool_calls>`、`<<|DSML|tool_calls>`、`<|DSML tool_calls>`、`<DSMLtool_calls>`、`<<DSML|DSML|tool_calls>`）正常解析
 - 混搭标签（DSML wrapper + canonical inner）归一化后正常解析
 - 波浪线围栏 `~~~` 内的示例不执行
 - 嵌套围栏（4 反引号嵌套 3 反引号）内的示例不执行
--- a/go.mod
+++ b/go.mod
@@ -6,10 +6,13 @@ require (
 	github.com/andybalholm/brotli v1.2.1
 	github.com/go-chi/chi/v5 v5.2.5
 	github.com/google/uuid v1.6.0
+	github.com/hupe1980/go-tiktoken v0.0.10
 	github.com/refraction-networking/utls v1.8.2
 	github.com/router-for-me/CLIProxyAPI/v6 v6.9.14
 )

+require github.com/dlclark/regexp2 v1.11.5 // indirect
+
 require (
 	github.com/klauspost/compress v1.18.5 // indirect
 	github.com/sirupsen/logrus v1.9.4 // indirect
--- a/go.sum
+++ b/go.sum
@@ -2,10 +2,14 @@ github.com/andybalholm/brotli v1.2.1 h1:R+f5xP285VArJDRgowrfb9DqL18yVK0gKAW/F+eT
 github.com/andybalholm/brotli v1.2.1/go.mod h1:rzTDkvFWvIrjDXZHkuS16NPggd91W3kUSvPlQ1pLaKY=
 github.com/davecgh/go-spew v1.1.1 h1:vj9j/u1bqnvCEfJOwUhtlOARqs3+rkHYY13jYWTU97c=
 github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
+github.com/dlclark/regexp2 v1.11.5 h1:Q/sSnsKerHeCkc/jSTNq1oCm7KiVgUMZRDUoRu0JQZQ=
+github.com/dlclark/regexp2 v1.11.5/go.mod h1:DHkYz0B9wPfa6wondMfaivmHpzrQ3v9q8cnmRbL6yW8=
 github.com/go-chi/chi/v5 v5.2.5 h1:Eg4myHZBjyvJmAFjFvWgrqDTXFyOzjj7YIm3L3mu6Ug=
 github.com/go-chi/chi/v5 v5.2.5/go.mod h1:X7Gx4mteadT3eDOMTsXzmI4/rwUpOwBHLpAfupzFJP0=
 github.com/google/uuid v1.6.0 h1:NIvaJDMOsjHA8n1jAhLSgzrAzy1Hgr+hNrb57e+94F0=
 github.com/google/uuid v1.6.0/go.mod h1:TIyPZe4MgqvfeYDBFedMoGGpEw/LqOeaOT+nhxU+yHo=
+github.com/hupe1980/go-tiktoken v0.0.10 h1:m6phOJaGyctqWdGIgwn9X8AfJvaG74tnQoDL+ntOUEQ=
+github.com/hupe1980/go-tiktoken v0.0.10/go.mod h1:NME6d8hrE+Jo+kLUZHhXShYV8e40hYkm4BbSLQKtvAo=
 github.com/klauspost/compress v1.18.5 h1:/h1gH5Ce+VWNLSWqPzOVn6XBO+vJbCNGvjoaGBFW2IE=
 github.com/klauspost/compress v1.18.5/go.mod h1:cwPg85FWrGar70rWktvGQj8/hthj3wpl0PGDogxkrSQ=
 github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM=
@@ -37,6 +41,8 @@ golang.org/x/net v0.52.0 h1:He/TN1l0e4mmR3QqHMT2Xab3Aj3L9qjbhRm78/6jrW0=
 golang.org/x/net v0.52.0/go.mod h1:R1MAz7uMZxVMualyPXb+VaqGSa3LIaUqk0eEt3w36Sw=
 golang.org/x/sys v0.42.0 h1:omrd2nAlyT5ESRdCLYdm3+fMfNFE/+Rf4bDIQImRJeo=
 golang.org/x/sys v0.42.0/go.mod h1:4GL1E5IUh+htKOUEOaiffhrAeqysfVGipDYzABqnCmw=
+golang.org/x/text v0.35.0 h1:JOVx6vVDFokkpaq1AEptVzLTpDe9KGpj5tR4/X+ybL8=
+golang.org/x/text v0.35.0/go.mod h1:khi/HExzZJ2pGnjenulevKNX1W67CUy0AsXcNubPGCA=
 gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405 h1:yhCVgyC4o1eVCa2tZl7eS0r+SDo693bJlVdllGtEeKM=
 gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
 gopkg.in/yaml.v3 v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA=
--- a/internal/chathistory/store.go
+++ b/internal/chathistory/store.go
@@ -14,6 +14,7 @@ import (
 	"github.com/google/uuid"

 	"ds2api/internal/config"
+	"ds2api/internal/util"
 )

 const (
@@ -309,8 +310,12 @@ func (s *Store) Update(id string, params UpdateParams) (Entry, error) {
 	if params.Status != "" {
 		item.Status = params.Status
 	}
-	item.ReasoningContent = params.ReasoningContent
-	item.Content = params.Content
+	if params.ReasoningContent != "" || item.ReasoningContent == "" {
+		item.ReasoningContent = params.ReasoningContent
+	}
+	if params.Content != "" || item.Content == "" {
+		item.Content = params.Content
+	}
 	item.Error = strings.TrimSpace(params.Error)
 	item.StatusCode = params.StatusCode
 	item.ElapsedMs = params.ElapsedMs
@@ -610,8 +615,8 @@ func buildPreview(item Entry) string {
 	if candidate == "" {
 		candidate = strings.TrimSpace(item.UserInput)
 	}
-	if len(candidate) > defaultPreviewAt {
-		return candidate[:defaultPreviewAt] + "..."
+	if truncated, ok := util.TruncateRunes(candidate, defaultPreviewAt); ok {
+		return truncated + "..."
 	}
 	return candidate
 }
--- a/internal/chathistory/store_test.go
+++ b/internal/chathistory/store_test.go
@@ -8,6 +8,7 @@ import (
 	"strings"
 	"sync"
 	"testing"
+	"unicode/utf8"
 )

 func blockDetailDir(t *testing.T, detailDir string) func() {
@@ -105,6 +106,17 @@ func TestStoreCreatesAndPersistsEntries(t *testing.T) {
 	}
 }

+func TestBuildPreviewPreservesUTF8MB4Characters(t *testing.T) {
+	long := strings.Repeat("😀", defaultPreviewAt+1)
+	preview := buildPreview(Entry{Content: long})
+	if !utf8.ValidString(preview) {
+		t.Fatalf("expected valid utf-8 preview, got %q", preview)
+	}
+	if preview != strings.Repeat("😀", defaultPreviewAt)+"..." {
+		t.Fatalf("unexpected preview: %q", preview)
+	}
+}
+
 func TestStoreTrimsToConfiguredLimit(t *testing.T) {
 	path := filepath.Join(t.TempDir(), "chat_history.json")
 	store := New(path)
@@ -481,3 +493,112 @@ func TestStoreWritesOnlyChangedDetailFiles(t *testing.T) {
 		t.Fatalf("expected untouched detail file to remain byte-identical")
 	}
 }
+
+func TestUpdatePreservesContentWhenNewContentIsEmpty(t *testing.T) {
+	path := filepath.Join(t.TempDir(), "chat_history.json")
+	store := New(path)
+
+	started, err := store.Start(StartParams{
+		CallerID:  "caller:abc",
+		Model:     "deepseek-v4-flash",
+		Stream:    true,
+		UserInput: "hello",
+	})
+	if err != nil {
+		t.Fatalf("start entry failed: %v", err)
+	}
+
+	if _, err := store.Update(started.ID, UpdateParams{
+		Status:           "streaming",
+		ReasoningContent: "let me think",
+		Content:          "I'll help you with that.",
+	}); err != nil {
+		t.Fatalf("progress update failed: %v", err)
+	}
+
+	updated, err := store.Update(started.ID, UpdateParams{
+		Status:    "success",
+		Content:   "",
+		Completed: true,
+	})
+	if err != nil {
+		t.Fatalf("success update failed: %v", err)
+	}
+
+	if updated.Content != "I'll help you with that." {
+		t.Fatalf("expected content to be preserved, got %q", updated.Content)
+	}
+	if updated.ReasoningContent != "let me think" {
+		t.Fatalf("expected reasoning content to be preserved, got %q", updated.ReasoningContent)
+	}
+
+	full, err := store.Get(started.ID)
+	if err != nil {
+		t.Fatalf("get entry failed: %v", err)
+	}
+	if full.Content != "I'll help you with that." {
+		t.Fatalf("expected persisted content to be preserved, got %q", full.Content)
+	}
+	if full.ReasoningContent != "let me think" {
+		t.Fatalf("expected persisted reasoning content to be preserved, got %q", full.ReasoningContent)
+	}
+}
+
+func TestUpdateAllowsSettingContentFromEmpty(t *testing.T) {
+	path := filepath.Join(t.TempDir(), "chat_history.json")
+	store := New(path)
+
+	started, err := store.Start(StartParams{
+		CallerID:  "caller:abc",
+		Model:     "deepseek-v4-flash",
+		Stream:    true,
+		UserInput: "hello",
+	})
+	if err != nil {
+		t.Fatalf("start entry failed: %v", err)
+	}
+
+	updated, err := store.Update(started.ID, UpdateParams{
+		Status:  "success",
+		Content: "final answer",
+	})
+	if err != nil {
+		t.Fatalf("update failed: %v", err)
+	}
+	if updated.Content != "final answer" {
+		t.Fatalf("expected content to be set, got %q", updated.Content)
+	}
+}
+
+func TestUpdateAllowsOverwritingContentWithNewValue(t *testing.T) {
+	path := filepath.Join(t.TempDir(), "chat_history.json")
+	store := New(path)
+
+	started, err := store.Start(StartParams{
+		CallerID:  "caller:abc",
+		Model:     "deepseek-v4-flash",
+		Stream:    true,
+		UserInput: "hello",
+	})
+	if err != nil {
+		t.Fatalf("start entry failed: %v", err)
+	}
+
+	if _, err := store.Update(started.ID, UpdateParams{
+		Status:  "streaming",
+		Content: "partial",
+	}); err != nil {
+		t.Fatalf("first update failed: %v", err)
+	}
+
+	updated, err := store.Update(started.ID, UpdateParams{
+		Status:  "success",
+		Content: "final answer",
+	})
+	if err != nil {
+		t.Fatalf("second update failed: %v", err)
+	}
+	if updated.Content != "final answer" {
+		t.Fatalf("expected content to be overwritten, got %q", updated.Content)
+	}
+}
--- a/internal/config/config_edge_test.go
+++ b/internal/config/config_edge_test.go
@@ -79,13 +79,20 @@ func TestGetModelConfigDeepSeekExpertReasonerSearch(t *testing.T) {
 	}
 }

-func TestGetModelConfigDeepSeekVisionReasonerSearch(t *testing.T) {
-	thinking, search, ok := GetModelConfig("deepseek-v4-vision-search")
+func TestGetModelConfigDeepSeekVision(t *testing.T) {
+	thinking, search, ok := GetModelConfig("deepseek-v4-vision")
 	if !ok {
-		t.Fatal("expected ok for deepseek-v4-vision-search")
+		t.Fatal("expected ok for deepseek-v4-vision")
 	}
-	if !thinking || !search {
-		t.Fatalf("expected both true, got thinking=%v search=%v", thinking, search)
+	if !thinking || search {
+		t.Fatalf("expected thinking=true search=false, got thinking=%v search=%v", thinking, search)
+	}
+}
+
+func TestGetModelConfigDeepSeekVisionSearchUnsupported(t *testing.T) {
+	_, _, ok := GetModelConfig("deepseek-v4-vision-search")
+	if ok {
+		t.Fatal("expected deepseek-v4-vision-search to be unsupported")
 	}
 }

@@ -748,18 +755,16 @@ func TestOpenAIModelsResponse(t *testing.T) {
 		t.Fatal("expected non-empty models list")
 	}
 	expected := map[string]bool{
-		"deepseek-v4-flash":                    false,
-		"deepseek-v4-flash-nothinking":         false,
-		"deepseek-v4-pro":                      false,
-		"deepseek-v4-pro-nothinking":           false,
-		"deepseek-v4-flash-search":             false,
-		"deepseek-v4-flash-search-nothinking":  false,
-		"deepseek-v4-pro-search":               false,
-		"deepseek-v4-pro-search-nothinking":    false,
-		"deepseek-v4-vision":                   false,
-		"deepseek-v4-vision-nothinking":        false,
-		"deepseek-v4-vision-search":            false,
-		"deepseek-v4-vision-search-nothinking": false,
+		"deepseek-v4-flash":                   false,
+		"deepseek-v4-flash-nothinking":        false,
+		"deepseek-v4-pro":                     false,
+		"deepseek-v4-pro-nothinking":          false,
+		"deepseek-v4-flash-search":            false,
+		"deepseek-v4-flash-search-nothinking": false,
+		"deepseek-v4-pro-search":              false,
+		"deepseek-v4-pro-search-nothinking":   false,
+		"deepseek-v4-vision":                  false,
+		"deepseek-v4-vision-nothinking":       false,
 	}
 	for _, model := range data {
 		if _, ok := expected[model.ID]; ok {
--- a/internal/config/model_alias_test.go
+++ b/internal/config/model_alias_test.go
@@ -144,10 +144,17 @@ func TestResolveModelCustomAliasToExpert(t *testing.T) {

 func TestResolveModelCustomAliasToVision(t *testing.T) {
 	got, ok := ResolveModel(mockModelAliasReader{
-		"my-vision-model": "deepseek-v4-vision-search",
+		"my-vision-model": "deepseek-v4-vision",
 	}, "my-vision-model")
-	if !ok || got != "deepseek-v4-vision-search" {
-		t.Fatalf("expected alias -> deepseek-v4-vision-search, got ok=%v model=%q", ok, got)
+	if !ok || got != "deepseek-v4-vision" {
+		t.Fatalf("expected alias -> deepseek-v4-vision, got ok=%v model=%q", ok, got)
+	}
+}
+
+func TestResolveModelHeuristicVisionIgnoresSearchSuffix(t *testing.T) {
+	got, ok := ResolveModel(nil, "gemini-vision-search")
+	if !ok || got != "deepseek-v4-vision" {
+		t.Fatalf("expected heuristic vision alias to resolve without search variant, got ok=%v model=%q", ok, got)
 	}
 }

--- a/internal/config/models.go
+++ b/internal/config/models.go
@@ -22,7 +22,6 @@ var deepSeekBaseModels = []ModelInfo{
 	{ID: "deepseek-v4-flash-search", Object: "model", Created: 1677610602, OwnedBy: "deepseek", Permission: []any{}},
 	{ID: "deepseek-v4-pro-search", Object: "model", Created: 1677610602, OwnedBy: "deepseek", Permission: []any{}},
 	{ID: "deepseek-v4-vision", Object: "model", Created: 1677610602, OwnedBy: "deepseek", Permission: []any{}},
-	{ID: "deepseek-v4-vision-search", Object: "model", Created: 1677610602, OwnedBy: "deepseek", Permission: []any{}},
 }

 var DeepSeekModels = appendNoThinkingVariants(deepSeekBaseModels)
@@ -67,7 +66,7 @@ func GetModelConfig(model string) (thinking bool, search bool, ok bool) {
 	switch baseModel {
 	case "deepseek-v4-flash", "deepseek-v4-pro", "deepseek-v4-vision":
 		return !noThinking, false, true
-	case "deepseek-v4-flash-search", "deepseek-v4-pro-search", "deepseek-v4-vision-search":
+	case "deepseek-v4-flash-search", "deepseek-v4-pro-search":
 		return !noThinking, true, true
 	default:
 		return false, false, false
@@ -81,7 +80,7 @@ func GetModelType(model string) (modelType string, ok bool) {
 		return "default", true
 	case "deepseek-v4-pro", "deepseek-v4-pro-search":
 		return "expert", true
-	case "deepseek-v4-vision", "deepseek-v4-vision-search":
+	case "deepseek-v4-vision":
 		return "vision", true
 	default:
 		return "", false
@@ -359,8 +358,6 @@ func resolveCanonicalModel(aliases map[string]string, model string) (string, boo
 	useSearch := strings.Contains(model, "search")

 	switch {
-	case useVision && useSearch:
-		return "deepseek-v4-vision-search", true
 	case useVision:
 		return "deepseek-v4-vision", true
 	case useReasoner && useSearch:
--- a/internal/config/paths.go
+++ b/internal/config/paths.go
@@ -30,9 +30,29 @@ func ResolvePath(envKey, defaultRel string) string {
 }

 func ConfigPath() string {
+	if strings.TrimSpace(os.Getenv("DS2API_CONFIG_PATH")) == "" && BaseDir() == "/app" {
+		return containerDefaultConfigPath()
+	}
 	return ResolvePath("DS2API_CONFIG_PATH", "config.json")
 }

+func containerDefaultConfigPath() string {
+	// Container images run as non-root by default. Only use /data when mounted/provisioned.
+	// Otherwise keep /app/config.json so admin-side save does not fail on MkdirAll("/data").
+	if st, err := os.Stat("/data"); err == nil && st.IsDir() {
+		return "/data/config.json"
+	}
+	return "/app/config.json"
+}
+
+func legacyContainerConfigPath() string {
+	return "/app/config.json"
+}
+
+func shouldTryLegacyContainerConfigPath() bool {
+	return strings.TrimSpace(os.Getenv("DS2API_CONFIG_PATH")) == "" && BaseDir() == "/app"
+}
+
 func RawStreamSampleRoot() string {
 	return ResolvePath("DS2API_RAW_STREAM_SAMPLE_ROOT", "tests/raw_stream_samples")
 }
--- a/internal/config/paths_test.go
+++ b/internal/config/paths_test.go
@@ -0,0 +1,28 @@
+package config
+
+import (
+	"os"
+	"testing"
+)
+
+func TestContainerDefaultConfigPath(t *testing.T) {
+	t.Run("fallback to /app when /data is missing", func(t *testing.T) {
+		// This test environment does not guarantee a writable/mounted /data.
+		// If /data is absent we must keep /app fallback to avoid persistence failures.
+		if _, err := os.Stat("/data"); err == nil {
+			t.Skip("/data exists in this environment; cannot validate missing-/data fallback")
+		}
+		if got := containerDefaultConfigPath(); got != "/app/config.json" {
+			t.Fatalf("containerDefaultConfigPath() = %q, want %q", got, "/app/config.json")
+		}
+	})
+
+	t.Run("prefer /data when /data directory exists", func(t *testing.T) {
+		if _, err := os.Stat("/data"); err != nil {
+			t.Skip("/data does not exist in this environment")
+		}
+		if got := containerDefaultConfigPath(); got != "/data/config.json" {
+			t.Fatalf("containerDefaultConfigPath() = %q, want %q", got, "/data/config.json")
+		}
+	})
+}
--- a/internal/config/store.go
+++ b/internal/config/store.go
@@ -87,12 +87,17 @@ func loadConfig() (Config, bool, error) {
 		}
 		return cfg, true, err
 	}
-
 	cfg, err := loadConfigFromFile(ConfigPath())
 	if err != nil {
+		if shouldTryLegacyContainerConfigPath() {
+			legacyPath := legacyContainerConfigPath()
+			if legacyCfg, legacyErr := loadConfigFromFile(legacyPath); legacyErr == nil {
+				Logger.Info("[config] loaded legacy container config path", "path", legacyPath)
+				return legacyCfg, false, nil
+			}
+		}
 		if IsVercel() {
-			// Vercel one-click deploy may start without a writable/present config file.
-			// Keep an in-memory config so users can bootstrap via WebUI then sync env.
+			// Vercel may start without writable/present config; keep in-memory bootstrap config.
 			return Config{}, true, nil
 		}
 		return Config{}, false, err
--- a/internal/deepseek/client/client_continue.go
+++ b/internal/deepseek/client/client_continue.go
@@ -7,6 +7,7 @@ import (
 	dsprotocol "ds2api/internal/deepseek/protocol"
 	"encoding/json"
 	"errors"
+	"fmt"
 	"io"
 	"net/http"
 	"strings"
@@ -27,7 +28,7 @@ type continueState struct {
 }

 // wrapCompletionWithAutoContinue wraps the completion response body so that
-// if the upstream indicates the response is incomplete (WIP / INCOMPLETE /
+// if the upstream indicates the response is incomplete (INCOMPLETE /
 // AUTO_CONTINUE), ds2api will automatically call the DeepSeek continue
 // endpoint and splice the continuation SSE stream onto the original.
 // The caller sees a single, seamless SSE stream.
@@ -132,33 +133,51 @@ func pumpAutoContinue(ctx context.Context, pw *io.PipeWriter, initial io.ReadClo
 // sentinels are consumed (not forwarded) so that the downstream only sees
 // one final [DONE] at the very end.
 func streamBodyWithContinueState(ctx context.Context, pw *io.PipeWriter, body io.Reader, state *continueState) (bool, error) {
-	scanner := bufio.NewScanner(body)
-	scanner.Buffer(make([]byte, 0, 64*1024), 2*1024*1024)
+	reader := bufio.NewReaderSize(body, 64*1024)
 	hadDone := false
-	for scanner.Scan() {
+	for {
 		select {
 		case <-ctx.Done():
 			return hadDone, ctx.Err()
 		default:
 		}
-		line := append([]byte{}, scanner.Bytes()...)
-		trimmed := strings.TrimSpace(string(line))
-		if trimmed == "" {
-			continue
-		}
-		if strings.HasPrefix(trimmed, "data:") {
-			data := strings.TrimSpace(strings.TrimPrefix(trimmed, "data:"))
-			if data == "[DONE]" {
-				hadDone = true
-				continue
+		line, err := reader.ReadBytes('\n')
+		if len(line) == 0 && err != nil {
+			if err == io.EOF {
+				return hadDone, nil
 			}
-			state.observe(data)
+			return hadDone, err
 		}
-		if _, err := io.Copy(pw, bytes.NewReader(append(line, '\n'))); err != nil {
+		trimmed := strings.TrimSpace(string(line))
+		if trimmed != "" {
+			if strings.HasPrefix(trimmed, "data:") {
+				data := strings.TrimSpace(strings.TrimPrefix(trimmed, "data:"))
+				if data == "[DONE]" {
+					hadDone = true
+					if err != nil && err != io.EOF {
+						return hadDone, err
+					}
+					if err == io.EOF {
+						return hadDone, nil
+					}
+					continue
+				}
+				state.observe(data)
+			}
+			if !strings.HasSuffix(string(line), "\n") {
+				line = append(line, '\n')
+			}
+			if _, copyErr := io.Copy(pw, bytes.NewReader(line)); copyErr != nil {
+				return hadDone, copyErr
+			}
+		}
+		if err != nil {
+			if err == io.EOF {
+				return hadDone, nil
+			}
 			return hadDone, err
 		}
 	}
-	return hadDone, scanner.Err()
 }

 // observe extracts continue-relevant signals from an SSE JSON chunk.
@@ -174,49 +193,100 @@ func (s *continueState) observe(data string) {
 	if id := intFrom(chunk["response_message_id"]); id > 0 {
 		s.responseMessageID = id
 	}
-	// Path-based status: {"p": "response/status", "v": "FINISHED"}
-	if p, _ := chunk["p"].(string); p == "response/status" {
-		if status, _ := chunk["v"].(string); status != "" {
-			s.lastStatus = strings.TrimSpace(status)
-			if strings.EqualFold(s.lastStatus, "FINISHED") {
-				s.finished = true
-			}
-		}
+	s.observeDirectPatch(asString(chunk["p"]), chunk["v"])
+	if p, _ := chunk["p"].(string); p == "response" {
+		s.observeBatchPatches("response", chunk["v"])
+	} else {
+		s.observeBatchPatches("", chunk["v"])
 	}
-	// Nested v.response
-	v, _ := chunk["v"].(map[string]any)
-	if response, _ := v["response"].(map[string]any); response != nil {
-		if id := intFrom(response["message_id"]); id > 0 {
-			s.responseMessageID = id
-		}
-		if status, _ := response["status"].(string); status != "" {
-			s.lastStatus = strings.TrimSpace(status)
-			if strings.EqualFold(s.lastStatus, "FINISHED") {
-				s.finished = true
-			}
-		}
-		if autoContinue, ok := response["auto_continue"].(bool); ok && autoContinue {
+	if v, _ := chunk["v"].(map[string]any); v != nil {
+		s.observeResponseObject(v["response"])
+	}
+	if message, _ := chunk["message"].(map[string]any); message != nil {
+		s.observeResponseObject(message["response"])
+	}
+}
+
+func (s *continueState) observeDirectPatch(path string, value any) {
+	if s == nil {
+		return
+	}
+	switch strings.Trim(strings.TrimSpace(path), "/") {
+	case "response/status", "status", "response/quasi_status", "quasi_status":
+		s.setStatus(asString(value))
+	case "response/auto_continue", "auto_continue":
+		if v, ok := value.(bool); ok && v {
 			s.lastStatus = "AUTO_CONTINUE"
 		}
 	}
-	// Nested message.response
-	if message, _ := chunk["message"].(map[string]any); message != nil {
-		if response, _ := message["response"].(map[string]any); response != nil {
-			if id := intFrom(response["message_id"]); id > 0 {
-				s.responseMessageID = id
-			}
-			if status, _ := response["status"].(string); status != "" {
-				s.lastStatus = strings.TrimSpace(status)
-				if strings.EqualFold(s.lastStatus, "FINISHED") {
-					s.finished = true
-				}
+}
+
+func (s *continueState) observeResponseObject(raw any) {
+	if s == nil {
+		return
+	}
+	response, _ := raw.(map[string]any)
+	if response == nil {
+		return
+	}
+	if id := intFrom(response["message_id"]); id > 0 {
+		s.responseMessageID = id
+	}
+	s.setStatus(asString(response["status"]))
+	if autoContinue, ok := response["auto_continue"].(bool); ok && autoContinue {
+		s.lastStatus = "AUTO_CONTINUE"
+	}
+}
+
+func (s *continueState) observeBatchPatches(parentPath string, raw any) {
+	if s == nil {
+		return
+	}
+	patches, ok := raw.([]any)
+	if !ok {
+		return
+	}
+	for _, patch := range patches {
+		m, ok := patch.(map[string]any)
+		if !ok {
+			continue
+		}
+		path := strings.TrimSpace(asString(m["p"]))
+		if path == "" {
+			continue
+		}
+		fullPath := path
+		if parent := strings.Trim(strings.TrimSpace(parentPath), "/"); parent != "" && !strings.Contains(path, "/") {
+			fullPath = parent + "/" + path
+		}
+		switch strings.Trim(strings.TrimSpace(fullPath), "/") {
+		case "response/status", "status", "response/quasi_status", "quasi_status":
+			s.setStatus(asString(m["v"]))
+		case "response/auto_continue", "auto_continue":
+			if v, ok := m["v"].(bool); ok && v {
+				s.lastStatus = "AUTO_CONTINUE"
 			}
 		}
 	}
 }

-// shouldContinue returns true when the upstream indicates the response is
-// not yet finished and we have enough information to issue a continue request.
+func (s *continueState) setStatus(status string) {
+	if s == nil {
+		return
+	}
+	normalized := strings.TrimSpace(status)
+	if normalized == "" {
+		return
+	}
+	s.lastStatus = normalized
+	if strings.EqualFold(normalized, "FINISHED") || strings.EqualFold(normalized, "CONTENT_FILTER") {
+		s.finished = true
+	}
+}
+
+// shouldContinue returns true when the upstream explicitly indicates the
+// response is incomplete and we have enough information to issue a continue
+// request. Plain WIP is not sufficient because normal streams begin in WIP.
 func (s *continueState) shouldContinue() bool {
 	if s == nil {
 		return false
@@ -225,7 +295,7 @@ func (s *continueState) shouldContinue() bool {
 		return false
 	}
 	switch strings.ToUpper(strings.TrimSpace(s.lastStatus)) {
-	case "WIP", "INCOMPLETE", "AUTO_CONTINUE":
+	case "INCOMPLETE", "AUTO_CONTINUE":
 		return true
 	default:
 		return false
@@ -241,3 +311,19 @@ func (s *continueState) prepareForNextRound() {
 	s.finished = false
 	s.lastStatus = ""
 }
+
+func asString(v any) string {
+	if v == nil {
+		return ""
+	}
+	switch x := v.(type) {
+	case string:
+		return x
+	default:
+		s := strings.TrimSpace(strings.ReplaceAll(strings.TrimSpace(fmt.Sprint(v)), "\u0000", ""))
+		if s == "<nil>" {
+			return ""
+		}
+		return s
+	}
+}
--- a/internal/deepseek/client/client_continue_test.go
+++ b/internal/deepseek/client/client_continue_test.go
@@ -8,6 +8,7 @@ import (
 	"io"
 	"net/http"
 	"strings"
+	"sync/atomic"
 	"testing"

 	"ds2api/internal/auth"
@@ -124,6 +125,146 @@ func TestCallCompletionAutoContinueThreadsPowHeader(t *testing.T) {
 	}
 }

+func TestAutoContinueDoesNotTriggerOnPlainWIPWithoutExplicitContinuationSignal(t *testing.T) {
+	initialBody := strings.Join([]string{
+		`data: {"response_message_id":321,"v":{"response":{"message_id":321,"status":"WIP","auto_continue":false}}}`,
+		`data: [DONE]`,
+	}, "\n") + "\n"
+
+	var continueCalls atomic.Int32
+	body := newAutoContinueBody(context.Background(), io.NopCloser(strings.NewReader(initialBody)), "session-123", 8, func(context.Context, string, int) (*http.Response, error) {
+		continueCalls.Add(1)
+		return nil, errors.New("continue should not have been called")
+	})
+	defer func() { _ = body.Close() }()
+
+	out, err := io.ReadAll(body)
+	if err != nil {
+		t.Fatalf("read body failed: %v", err)
+	}
+	if continueCalls.Load() != 0 {
+		t.Fatalf("expected no continue calls, got %d", continueCalls.Load())
+	}
+	if !bytes.Contains(out, []byte(`"status":"WIP"`)) || !bytes.Contains(out, []byte(`data: [DONE]`)) {
+		t.Fatalf("expected original body to pass through unchanged, got=%s", string(out))
+	}
+}
+
+func TestAutoContinuePassesThroughLongSingleSSELine(t *testing.T) {
+	payload := strings.Repeat("x", 2*1024*1024+4096)
+	initialBody := `data: {"p":"response/content","v":"` + payload + `"}` + "\n" +
+		`data: [DONE]` + "\n"
+
+	body := newAutoContinueBody(context.Background(), io.NopCloser(strings.NewReader(initialBody)), "session-123", 8, func(context.Context, string, int) (*http.Response, error) {
+		return nil, errors.New("continue should not have been called")
+	})
+	defer func() { _ = body.Close() }()
+
+	out, err := io.ReadAll(body)
+	if err != nil {
+		t.Fatalf("read body failed: %v", err)
+	}
+	if !bytes.Contains(out, []byte(payload)) {
+		t.Fatalf("expected long SSE payload to pass through, got len=%d want payload len=%d", len(out), len(payload))
+	}
+	if !bytes.Contains(out, []byte(`data: [DONE]`)) {
+		t.Fatalf("expected final DONE sentinel in body, got len=%d", len(out))
+	}
+}
+
+func TestAutoContinueTriggersOnDirectQuasiStatusIncomplete(t *testing.T) {
+	initialBody := strings.Join([]string{
+		`data: {"response_message_id":321,"p":"response/content","v":"<tool_calls><invoke name=\"write_file\"><parameter name=\"content\"><![CDATA[part-one"}`,
+		`data: {"p":"response/quasi_status","v":"INCOMPLETE"}`,
+		`data: [DONE]`,
+	}, "\n") + "\n"
+
+	var continueCalls atomic.Int32
+	body := newAutoContinueBody(context.Background(), io.NopCloser(strings.NewReader(initialBody)), "session-123", 8, func(context.Context, string, int) (*http.Response, error) {
+		continueCalls.Add(1)
+		return &http.Response{
+			StatusCode: http.StatusOK,
+			Header:     make(http.Header),
+			Body: io.NopCloser(strings.NewReader(
+				`data: {"response_message_id":322,"p":"response/content","v":"-part-two]]></parameter></invoke></tool_calls>"}` + "\n" +
+					`data: {"p":"response/status","v":"FINISHED"}` + "\n" +
+					`data: [DONE]` + "\n",
+			)),
+		}, nil
+	})
+	defer func() { _ = body.Close() }()
+
+	out, err := io.ReadAll(body)
+	if err != nil {
+		t.Fatalf("read body failed: %v", err)
+	}
+	if continueCalls.Load() != 1 {
+		t.Fatalf("expected exactly one continue call, got %d", continueCalls.Load())
+	}
+	if !bytes.Contains(out, []byte("part-one")) || !bytes.Contains(out, []byte("-part-two")) {
+		t.Fatalf("expected continued tool content in body, got=%s", string(out))
+	}
+}
+
+func TestAutoContinueTriggersOnResponseBatchQuasiStatusIncomplete(t *testing.T) {
+	initialBody := strings.Join([]string{
+		`data: {"response_message_id":321,"v":{"response":{"message_id":321,"status":"WIP","auto_continue":false}}}`,
+		`data: {"p":"response","o":"BATCH","v":[{"p":"accumulated_token_usage","v":2413},{"p":"quasi_status","v":"INCOMPLETE"}]}`,
+		`data: [DONE]`,
+	}, "\n") + "\n"
+
+	var continueCalls atomic.Int32
+	body := newAutoContinueBody(context.Background(), io.NopCloser(strings.NewReader(initialBody)), "session-123", 8, func(context.Context, string, int) (*http.Response, error) {
+		continueCalls.Add(1)
+		return &http.Response{
+			StatusCode: http.StatusOK,
+			Header:     make(http.Header),
+			Body: io.NopCloser(strings.NewReader(
+				`data: {"response_message_id":322,"p":"response/status","v":"FINISHED"}` + "\n" +
+					`data: [DONE]` + "\n",
+			)),
+		}, nil
+	})
+	defer func() { _ = body.Close() }()
+
+	out, err := io.ReadAll(body)
+	if err != nil {
+		t.Fatalf("read body failed: %v", err)
+	}
+	if continueCalls.Load() != 1 {
+		t.Fatalf("expected exactly one continue call, got %d", continueCalls.Load())
+	}
+	if !bytes.Contains(out, []byte(`"quasi_status","v":"INCOMPLETE"`)) || !bytes.Contains(out, []byte(`"v":"FINISHED"`)) {
+		t.Fatalf("expected continued output to include initial and final rounds, got=%s", string(out))
+	}
+}
+
+func TestAutoContinueDoesNotTriggerWhenResponseBatchQuasiStatusFinished(t *testing.T) {
+	initialBody := strings.Join([]string{
+		`data: {"response_message_id":321,"v":{"response":{"message_id":321,"status":"WIP","auto_continue":false}}}`,
+		`data: {"p":"response","o":"BATCH","v":[{"p":"accumulated_token_usage","v":2413},{"p":"quasi_status","v":"FINISHED"}]}`,
+		`data: [DONE]`,
+	}, "\n") + "\n"
+
+	var continueCalls atomic.Int32
+	body := newAutoContinueBody(context.Background(), io.NopCloser(strings.NewReader(initialBody)), "session-123", 8, func(context.Context, string, int) (*http.Response, error) {
+		continueCalls.Add(1)
+		return nil, errors.New("continue should not have been called")
+	})
+	defer func() { _ = body.Close() }()
+
+	out, err := io.ReadAll(body)
+	if err != nil {
+		t.Fatalf("read body failed: %v", err)
+	}
+	if continueCalls.Load() != 0 {
+		t.Fatalf("expected no continue calls, got %d", continueCalls.Load())
+	}
+	if !bytes.Contains(out, []byte(`"quasi_status","v":"FINISHED"`)) || !bytes.Contains(out, []byte(`data: [DONE]`)) {
+		t.Fatalf("expected original finished body to pass through unchanged, got=%s", string(out))
+	}
+}
+
 type failingOrCompletionDoer struct {
 	completionResp *http.Response
 }
@@ -134,3 +275,33 @@ func (d failingOrCompletionDoer) Do(req *http.Request) (*http.Response, error) {
 	}
 	return nil, errors.New("forced stream failure")
 }
+
+func TestAutoContinuePreservesIncompleteStateWhenNextChunkOmitsStatus(t *testing.T) {
+	initialBody := strings.Join([]string{
+		`data: {"response_message_id":321,"v":{"response":{"message_id":321,"status":"INCOMPLETE"}}}`,
+		`data: {"p":"response/content","v":{"text":"continued"}}`,
+		`data: [DONE]`,
+	}, "\n") + "\n"
+
+	var continueCalls atomic.Int32
+	body := newAutoContinueBody(context.Background(), io.NopCloser(strings.NewReader(initialBody)), "session-123", 8, func(context.Context, string, int) (*http.Response, error) {
+		continueCalls.Add(1)
+		return &http.Response{
+			StatusCode: http.StatusOK,
+			Header:     make(http.Header),
+			Body: io.NopCloser(strings.NewReader(
+				`data: {"response_message_id":322,"p":"response/status","v":"FINISHED"}` + "\n" +
+					`data: [DONE]` + "\n",
+			)),
+		}, nil
+	})
+	defer func() { _ = body.Close() }()
+
+	_, err := io.ReadAll(body)
+	if err != nil {
+		t.Fatalf("read body failed: %v", err)
+	}
+	if continueCalls.Load() != 1 {
+		t.Fatalf("expected exactly one continue call, got %d", continueCalls.Load())
+	}
+}
--- a/internal/deepseek/protocol/constants_shared.json
+++ b/internal/deepseek/protocol/constants_shared.json
@@ -2,7 +2,7 @@
  "client": {
    "name": "DeepSeek",
    "platform": "android",
-    "version": "2.0.1",
+    "version": "2.0.4",
    "android_api_level": "35",
    "locale": "zh_CN"
  },
@@ -24,4 +24,4 @@
  "skip_exact_paths": [
    "response/search_status"
  ]
-}
+}
--- a/internal/deepseek/protocol/sse.go
+++ b/internal/deepseek/protocol/sse.go
@@ -2,20 +2,24 @@ package protocol

 import (
 	"bufio"
+	"io"
 	"net/http"
 )

 func ScanSSELines(resp *http.Response, onLine func([]byte) bool) error {
-	scanner := bufio.NewScanner(resp.Body)
-	buf := make([]byte, 0, 64*1024)
-	scanner.Buffer(buf, 2*1024*1024)
-	for scanner.Scan() {
-		if !onLine(scanner.Bytes()) {
-			break
+	reader := bufio.NewReaderSize(resp.Body, 64*1024)
+	for {
+		line, err := reader.ReadBytes('\n')
+		if len(line) > 0 {
+			if !onLine(line) {
+				return nil
+			}
+		}
+		if err != nil {
+			if err == io.EOF {
+				return nil
+			}
+			return err
 		}
 	}
-	if err := scanner.Err(); err != nil {
-		return err
-	}
-	return nil
 }
--- a/internal/deepseek/protocol/sse_test.go
+++ b/internal/deepseek/protocol/sse_test.go
@@ -0,0 +1,26 @@
+package protocol
+
+import (
+	"io"
+	"net/http"
+	"strings"
+	"testing"
+)
+
+func TestScanSSELinesHandlesLongSingleLine(t *testing.T) {
+	payload := strings.Repeat("x", 2*1024*1024+4096)
+	body := "data: {\"p\":\"response/content\",\"v\":\"" + payload + "\"}\n"
+	resp := &http.Response{Body: io.NopCloser(strings.NewReader(body))}
+
+	var got string
+	err := ScanSSELines(resp, func(line []byte) bool {
+		got = string(line)
+		return true
+	})
+	if err != nil {
+		t.Fatalf("ScanSSELines returned error: %v", err)
+	}
+	if !strings.Contains(got, payload) {
+		t.Fatalf("long SSE line was not preserved: got len=%d want payload len=%d", len(got), len(payload))
+	}
+}
--- a/internal/devcapture/store.go
+++ b/internal/devcapture/store.go
@@ -10,6 +10,8 @@ import (
 	"sync"
 	"time"

+	"ds2api/internal/util"
+
 	"github.com/google/uuid"
 )

@@ -194,7 +196,8 @@ func (c *captureBody) append(chunk string) {
 	}
 	remain := maxLen - current
 	if len(chunk) > remain {
-		c.buf.WriteString(chunk[:remain])
+		truncated, _ := util.TruncateUTF8Bytes(chunk, remain)
+		c.buf.WriteString(truncated)
 		c.truncated = true
 		return
 	}
--- a/internal/devcapture/store_test.go
+++ b/internal/devcapture/store_test.go
@@ -4,6 +4,7 @@ import (
 	"io"
 	"strings"
 	"testing"
+	"unicode/utf8"
 )

 func TestNewFromEnvDefaults(t *testing.T) {
@@ -82,3 +83,28 @@ func TestWrapBodyTruncatesByLimit(t *testing.T) {
 		t.Fatalf("expected account id, got %q", items[0].AccountID)
 	}
 }
+
+func TestWrapBodyTruncatesUTF8WithoutBreakingRune(t *testing.T) {
+	s := &Store{enabled: true, limit: 5, maxBodyBytes: 5}
+	session := s.Start("test", "http://x", "acc1", map[string]any{"x": 1})
+	if session == nil {
+		t.Fatal("expected session")
+	}
+	rc := session.WrapBody(io.NopCloser(strings.NewReader("😀xy")), 200)
+	_, _ = io.ReadAll(rc)
+	_ = rc.Close()
+
+	items := s.Snapshot()
+	if len(items) != 1 {
+		t.Fatalf("expected 1 item, got %d", len(items))
+	}
+	if !utf8.ValidString(items[0].ResponseBody) {
+		t.Fatalf("expected valid utf-8 response body, got %q", items[0].ResponseBody)
+	}
+	if items[0].ResponseBody != "😀x" {
+		t.Fatalf("expected rune-safe truncation, got %q", items[0].ResponseBody)
+	}
+	if !items[0].ResponseTruncated {
+		t.Fatal("expected truncated flag true")
+	}
+}
--- a/internal/format/claude/render.go
+++ b/internal/format/claude/render.go
@@ -5,6 +5,7 @@ import (
 	"fmt"
 	"time"

+	"ds2api/internal/prompt"
 	"ds2api/internal/util"
 )

@@ -43,8 +44,23 @@ func BuildMessageResponse(messageID, model string, normalizedMessages []any, fin
 		"stop_reason":   stopReason,
 		"stop_sequence": nil,
 		"usage": map[string]any{
-			"input_tokens":  util.EstimateTokens(fmt.Sprintf("%v", normalizedMessages)),
-			"output_tokens": util.EstimateTokens(finalThinking) + util.EstimateTokens(finalText),
+			"input_tokens":  util.CountPromptTokens(prompt.MessagesPrepareWithThinking(claudeMessageMaps(normalizedMessages), false), model),
+			"output_tokens": util.CountOutputTokens(finalThinking, model) + util.CountOutputTokens(finalText, model),
 		},
 	}
 }
+
+func claudeMessageMaps(messages []any) []map[string]any {
+	if len(messages) == 0 {
+		return nil
+	}
+	out := make([]map[string]any, 0, len(messages))
+	for _, item := range messages {
+		msg, ok := item.(map[string]any)
+		if !ok {
+			continue
+		}
+		out = append(out, msg)
+	}
+	return out
+}
--- a/internal/format/openai/render_chat.go
+++ b/internal/format/openai/render_chat.go
@@ -29,7 +29,7 @@ func BuildChatCompletionWithToolCalls(completionID, model, finalPrompt, finalThi
 		"created": time.Now().Unix(),
 		"model":   model,
 		"choices": []map[string]any{{"index": 0, "message": messageObj, "finish_reason": finishReason}},
-		"usage":   BuildChatUsage(finalPrompt, finalThinking, finalText),
+		"usage":   BuildChatUsageForModel(model, finalPrompt, finalThinking, finalText, 0),
 	}
 }

--- a/internal/format/openai/render_responses.go
+++ b/internal/format/openai/render_responses.go
@@ -70,7 +70,7 @@ func BuildResponseObjectFromItems(responseID, model, finalPrompt, finalThinking,
 		"model":       model,
 		"output":      output,
 		"output_text": outputText,
-		"usage":       BuildResponsesUsage(finalPrompt, finalThinking, finalText),
+		"usage":       BuildResponsesUsageForModel(model, finalPrompt, finalThinking, finalText, 0),
 	}
 }

--- a/internal/format/openai/render_test.go
+++ b/internal/format/openai/render_test.go
@@ -6,6 +6,7 @@ import (
 	"testing"

 	"ds2api/internal/toolcall"
+	"ds2api/internal/util"
 )

 func TestBuildResponseObjectKeepsFencedToolPayloadAsText(t *testing.T) {
@@ -177,3 +178,17 @@ func TestBuildResponseObjectWithToolCallsCoercesSchemaDeclaredStringArguments(t
 		t.Fatalf("expected response content stringified by schema, got %#v", args["content"])
 	}
 }
+
+func TestBuildChatUsageForModelUsesConservativePromptCount(t *testing.T) {
+	prompt := strings.Repeat("上下文token ", 40)
+	usage := BuildChatUsageForModel("deepseek-v4-flash", prompt, "", "ok", 0)
+	promptTokens, _ := usage["prompt_tokens"].(int)
+	if promptTokens <= util.EstimateTokens(prompt) {
+		t.Fatalf("expected conservative prompt token count > rough estimate, got=%d estimate=%d", promptTokens, util.EstimateTokens(prompt))
+	}
+	totalTokens, _ := usage["total_tokens"].(int)
+	completionTokens, _ := usage["completion_tokens"].(int)
+	if totalTokens != promptTokens+completionTokens {
+		t.Fatalf("expected total tokens to add up, got usage=%#v", usage)
+	}
+}
--- a/internal/format/openai/render_usage.go
+++ b/internal/format/openai/render_usage.go
@@ -2,10 +2,10 @@ package openai

 import "ds2api/internal/util"

-func BuildChatUsage(finalPrompt, finalThinking, finalText string) map[string]any {
-	promptTokens := util.EstimateTokens(finalPrompt)
-	reasoningTokens := util.EstimateTokens(finalThinking)
-	completionTokens := util.EstimateTokens(finalText)
+func BuildChatUsageForModel(model, finalPrompt, finalThinking, finalText string, refFileTokens int) map[string]any {
+	promptTokens := util.CountPromptTokens(finalPrompt, model) + refFileTokens
+	reasoningTokens := util.CountOutputTokens(finalThinking, model)
+	completionTokens := util.CountOutputTokens(finalText, model)
 	return map[string]any{
 		"prompt_tokens":     promptTokens,
 		"completion_tokens": reasoningTokens + completionTokens,
@@ -16,13 +16,21 @@ func BuildChatUsage(finalPrompt, finalThinking, finalText string) map[string]any
 	}
 }

-func BuildResponsesUsage(finalPrompt, finalThinking, finalText string) map[string]any {
-	promptTokens := util.EstimateTokens(finalPrompt)
-	reasoningTokens := util.EstimateTokens(finalThinking)
-	completionTokens := util.EstimateTokens(finalText)
+func BuildChatUsage(finalPrompt, finalThinking, finalText string) map[string]any {
+	return BuildChatUsageForModel("", finalPrompt, finalThinking, finalText, 0)
+}
+
+func BuildResponsesUsageForModel(model, finalPrompt, finalThinking, finalText string, refFileTokens int) map[string]any {
+	promptTokens := util.CountPromptTokens(finalPrompt, model) + refFileTokens
+	reasoningTokens := util.CountOutputTokens(finalThinking, model)
+	completionTokens := util.CountOutputTokens(finalText, model)
 	return map[string]any{
 		"input_tokens":  promptTokens,
 		"output_tokens": reasoningTokens + completionTokens,
 		"total_tokens":  promptTokens + reasoningTokens + completionTokens,
 	}
 }
+
+func BuildResponsesUsage(finalPrompt, finalThinking, finalText string) map[string]any {
+	return BuildResponsesUsageForModel("", finalPrompt, finalThinking, finalText, 0)
+}
--- a/internal/httpapi/admin/accounts/handler_accounts_testing.go
+++ b/internal/httpapi/admin/accounts/handler_accounts_testing.go
@@ -107,6 +107,7 @@ func (h *Handler) testAccount(ctx context.Context, acc config.Account, model, me
 		"model":           model,
 		"session_count":   0,
 		"config_writable": !h.Store.IsEnvBacked(),
+		"config_warning":  "",
 	}
 	defer func() {
 		status := "failed"
@@ -121,8 +122,7 @@ func (h *Handler) testAccount(ctx context.Context, acc config.Account, model, me
 		return result
 	}
 	if err := h.Store.UpdateAccountToken(acc.Identifier(), token); err != nil {
-		result["message"] = "登录成功但写入运行时 token 失败: " + err.Error()
-		return result
+		result["config_warning"] = "登录成功，但 token 持久化失败（仅保存在内存，重启后会丢失）: " + err.Error()
 	}
 	authCtx := &authn.RequestAuth{UseConfigToken: false, DeepSeekToken: token, AccountID: identifier, Account: acc}
 	proxyCtx := authn.WithAuth(ctx, authCtx)
@@ -136,8 +136,7 @@ func (h *Handler) testAccount(ctx context.Context, acc config.Account, model, me
 		token = newToken
 		authCtx.DeepSeekToken = token
 		if err := h.Store.UpdateAccountToken(acc.Identifier(), token); err != nil {
-			result["message"] = "刷新 token 成功但写入运行时 token 失败: " + err.Error()
-			return result
+			result["config_warning"] = "刷新 token 成功，但 token 持久化失败（仅保存在内存，重启后会丢失）: " + err.Error()
 		}
 		sessionID, err = h.DS.CreateSession(proxyCtx, authCtx, 1)
 		if err != nil {
@@ -155,6 +154,9 @@ func (h *Handler) testAccount(ctx context.Context, acc config.Account, model, me
 	if strings.TrimSpace(message) == "" {
 		result["success"] = true
 		result["message"] = "Token 刷新成功（登录与会话创建成功）"
+		if warning, _ := result["config_warning"].(string); strings.TrimSpace(warning) != "" {
+			result["message"] = result["message"].(string) + "；" + warning
+		}
 		result["response_time"] = int(time.Since(start).Milliseconds())
 		return result
 	}
--- a/internal/httpapi/admin/rawsamples/handler_raw_samples.go
+++ b/internal/httpapi/admin/rawsamples/handler_raw_samples.go
@@ -15,6 +15,7 @@ import (
 	"ds2api/internal/devcapture"
 	adminshared "ds2api/internal/httpapi/admin/shared"
 	"ds2api/internal/rawsample"
+	"ds2api/internal/util"
 )

 type captureChain struct {
@@ -479,10 +480,13 @@ func previewCaptureChainResponse(chain captureChain) string {

 func previewText(text string, limit int) string {
 	text = strings.TrimSpace(text)
-	if limit <= 0 || len(text) <= limit {
+	if limit <= 0 {
 		return text
 	}
-	return text[:limit] + "..."
+	if truncated, ok := util.TruncateRunes(text, limit); ok {
+		return truncated + "..."
+	}
+	return text
 }

 func captureChainHasTruncatedResponse(chain captureChain) bool {
--- a/internal/httpapi/admin/rawsamples/handler_raw_samples_test.go
+++ b/internal/httpapi/admin/rawsamples/handler_raw_samples_test.go
@@ -10,6 +10,7 @@ import (
 	"path/filepath"
 	"strings"
 	"testing"
+	"unicode/utf8"

 	"ds2api/internal/devcapture"
 )
@@ -231,6 +232,16 @@ func TestCombineCaptureBodiesPreservesOrderAndSeparators(t *testing.T) {
 	}
 }

+func TestPreviewTextPreservesUTF8MB4Characters(t *testing.T) {
+	preview := previewText(strings.Repeat("😀", 281), 280)
+	if !utf8.ValidString(preview) {
+		t.Fatalf("expected valid utf-8 preview, got %q", preview)
+	}
+	if preview != strings.Repeat("😀", 280)+"..." {
+		t.Fatalf("unexpected preview: %q", preview)
+	}
+}
+
 func TestQueryRawSampleCapturesGroupsBySessionAndMatchesQuestion(t *testing.T) {
 	devcapture.Global().Clear()
 	defer devcapture.Global().Clear()
--- a/internal/httpapi/claude/handler_helpers_misc.go
+++ b/internal/httpapi/claude/handler_helpers_misc.go
@@ -1,6 +1,7 @@
 package claude

 import (
+	"ds2api/internal/toolcall"
 	"fmt"
 	"strings"
 )
@@ -31,30 +32,9 @@ func extractClaudeToolNames(tools []any) []string {
 }

 func extractClaudeToolMeta(m map[string]any) (string, string, any) {
-	name, _ := m["name"].(string)
-	desc, _ := m["description"].(string)
-	schemaObj := m["input_schema"]
-	if schemaObj == nil {
-		schemaObj = m["parameters"]
-	}
-
-	if fn, ok := m["function"].(map[string]any); ok {
-		if strings.TrimSpace(name) == "" {
-			name, _ = fn["name"].(string)
-		}
-		if strings.TrimSpace(desc) == "" {
-			desc, _ = fn["description"].(string)
-		}
-		if schemaObj == nil {
-			if v, ok := fn["input_schema"]; ok {
-				schemaObj = v
-			}
-		}
-		if schemaObj == nil {
-			if v, ok := fn["parameters"]; ok {
-				schemaObj = v
-			}
-		}
+	name, desc, schemaObj := toolcall.ExtractToolMeta(m)
+	if strings.TrimSpace(desc) == "" {
+		desc = "No description available"
 	}
 	return strings.TrimSpace(name), strings.TrimSpace(desc), schemaObj
 }
--- a/internal/httpapi/claude/handler_messages.go
+++ b/internal/httpapi/claude/handler_messages.go
@@ -177,7 +177,7 @@ func stripClaudeThinkingBlocks(raw []byte) []byte {
 	return out
 }

-func (h *Handler) handleClaudeStreamRealtime(w http.ResponseWriter, r *http.Request, resp *http.Response, model string, messages []any, thinkingEnabled, searchEnabled bool, toolNames []string) {
+func (h *Handler) handleClaudeStreamRealtime(w http.ResponseWriter, r *http.Request, resp *http.Response, model string, messages []any, thinkingEnabled, searchEnabled bool, toolNames []string, toolsRaw any) {
 	defer func() { _ = resp.Body.Close() }()
 	if resp.StatusCode != http.StatusOK {
 		body, _ := io.ReadAll(resp.Body)
@@ -205,6 +205,8 @@ func (h *Handler) handleClaudeStreamRealtime(w http.ResponseWriter, r *http.Requ
 		searchEnabled,
 		h.compatStripReferenceMarkers(),
 		toolNames,
+		toolsRaw,
+		buildClaudePromptTokenText(messages, thinkingEnabled),
 	)
 	streamRuntime.sendMessageStart()

--- a/internal/httpapi/claude/handler_stream_test.go
+++ b/internal/httpapi/claude/handler_stream_test.go
@@ -81,7 +81,7 @@ func TestHandleClaudeStreamRealtimeTextIncrementsWithEventHeaders(t *testing.T)
 	rec := httptest.NewRecorder()
 	req := httptest.NewRequest(http.MethodPost, "/anthropic/v1/messages", nil)

-	h.handleClaudeStreamRealtime(rec, req, resp, "claude-sonnet-4-5", []any{map[string]any{"role": "user", "content": "hi"}}, false, false, nil)
+	h.handleClaudeStreamRealtime(rec, req, resp, "claude-sonnet-4-5", []any{map[string]any{"role": "user", "content": "hi"}}, false, false, nil, nil)

 	body := rec.Body.String()
 	if !strings.Contains(body, "event: message_start") {
@@ -122,7 +122,7 @@ func TestHandleClaudeStreamRealtimeThinkingDelta(t *testing.T) {
 	rec := httptest.NewRecorder()
 	req := httptest.NewRequest(http.MethodPost, "/anthropic/v1/messages", nil)

-	h.handleClaudeStreamRealtime(rec, req, resp, "claude-sonnet-4-5", []any{map[string]any{"role": "user", "content": "hi"}}, true, false, nil)
+	h.handleClaudeStreamRealtime(rec, req, resp, "claude-sonnet-4-5", []any{map[string]any{"role": "user", "content": "hi"}}, true, false, nil, nil)

 	frames := parseClaudeFrames(t, rec.Body.String())
 	foundThinkingDelta := false
@@ -149,7 +149,7 @@ func TestHandleClaudeStreamRealtimeSkipsThinkingFallbackWhenFinalTextExists(t *t
 	rec := httptest.NewRecorder()
 	req := httptest.NewRequest(http.MethodPost, "/anthropic/v1/messages", nil)

-	h.handleClaudeStreamRealtime(rec, req, resp, "claude-sonnet-4-5", []any{map[string]any{"role": "user", "content": "use tool"}}, true, false, []string{"search"})
+	h.handleClaudeStreamRealtime(rec, req, resp, "claude-sonnet-4-5", []any{map[string]any{"role": "user", "content": "use tool"}}, true, false, []string{"search"}, nil)

 	frames := parseClaudeFrames(t, rec.Body.String())
 	for _, f := range findClaudeFrames(frames, "content_block_start") {
@@ -180,7 +180,7 @@ func TestHandleClaudeStreamRealtimeUpstreamErrorEvent(t *testing.T) {
 	rec := httptest.NewRecorder()
 	req := httptest.NewRequest(http.MethodPost, "/anthropic/v1/messages", nil)

-	h.handleClaudeStreamRealtime(rec, req, resp, "claude-sonnet-4-5", []any{map[string]any{"role": "user", "content": "hi"}}, false, false, nil)
+	h.handleClaudeStreamRealtime(rec, req, resp, "claude-sonnet-4-5", []any{map[string]any{"role": "user", "content": "hi"}}, false, false, nil, nil)

 	frames := parseClaudeFrames(t, rec.Body.String())
 	errFrames := findClaudeFrames(frames, "error")
@@ -217,7 +217,7 @@ func TestHandleClaudeStreamRealtimePingEvent(t *testing.T) {

 	rec := httptest.NewRecorder()
 	req := httptest.NewRequest(http.MethodPost, "/anthropic/v1/messages", nil)
-	h.handleClaudeStreamRealtime(rec, req, resp, "claude-sonnet-4-5", []any{map[string]any{"role": "user", "content": "hi"}}, false, false, nil)
+	h.handleClaudeStreamRealtime(rec, req, resp, "claude-sonnet-4-5", []any{map[string]any{"role": "user", "content": "hi"}}, false, false, nil, nil)

 	frames := parseClaudeFrames(t, rec.Body.String())
 	if len(findClaudeFrames(frames, "ping")) == 0 {
@@ -271,7 +271,7 @@ func TestHandleClaudeStreamRealtimeToolSafetyAcrossStructuredFormats(t *testing.
 			rec := httptest.NewRecorder()
 			req := httptest.NewRequest(http.MethodPost, "/anthropic/v1/messages", nil)

-			h.handleClaudeStreamRealtime(rec, req, resp, "claude-sonnet-4-5", []any{map[string]any{"role": "user", "content": "use tool"}}, false, false, []string{"Bash"})
+			h.handleClaudeStreamRealtime(rec, req, resp, "claude-sonnet-4-5", []any{map[string]any{"role": "user", "content": "use tool"}}, false, false, []string{"Bash"}, nil)

 			frames := parseClaudeFrames(t, rec.Body.String())
 			foundToolUse := false
@@ -299,7 +299,7 @@ func TestHandleClaudeStreamRealtimeDetectsToolUseWithLeadingProse(t *testing.T)
 	rec := httptest.NewRecorder()
 	req := httptest.NewRequest(http.MethodPost, "/anthropic/v1/messages", nil)

-	h.handleClaudeStreamRealtime(rec, req, resp, "claude-sonnet-4-5", []any{map[string]any{"role": "user", "content": "use tool"}}, false, false, []string{"write_file"})
+	h.handleClaudeStreamRealtime(rec, req, resp, "claude-sonnet-4-5", []any{map[string]any{"role": "user", "content": "use tool"}}, false, false, []string{"write_file"}, nil)

 	frames := parseClaudeFrames(t, rec.Body.String())
 	foundToolUse := false
@@ -333,7 +333,7 @@ func TestHandleClaudeStreamRealtimeIgnoresUnclosedFencedToolExample(t *testing.T
 	rec := httptest.NewRecorder()
 	req := httptest.NewRequest(http.MethodPost, "/anthropic/v1/messages", nil)

-	h.handleClaudeStreamRealtime(rec, req, resp, "claude-sonnet-4-5", []any{map[string]any{"role": "user", "content": "show example only"}}, false, false, []string{"Bash"})
+	h.handleClaudeStreamRealtime(rec, req, resp, "claude-sonnet-4-5", []any{map[string]any{"role": "user", "content": "show example only"}}, false, false, []string{"Bash"}, nil)

 	frames := parseClaudeFrames(t, rec.Body.String())
 	foundToolUse := false
@@ -365,3 +365,48 @@ func TestHandleClaudeStreamRealtimeIgnoresUnclosedFencedToolExample(t *testing.T
 func TestHandleClaudeStreamRealtimePromotesUnclosedFencedToolExample(t *testing.T) {
 	TestHandleClaudeStreamRealtimeIgnoresUnclosedFencedToolExample(t)
 }
+
+func TestHandleClaudeStreamRealtimeNormalizesToolInputBySchema(t *testing.T) {
+	h := &Handler{}
+	resp := makeClaudeSSEHTTPResponse(
+		`data: {"p":"response/content","v":"<tool_calls><invoke name=\"Write\">{\"input\":{\"content\":{\"message\":\"hi\"},\"taskId\":1}}</invoke></tool_calls>"}`,
+		`data: [DONE]`,
+	)
+	rec := httptest.NewRecorder()
+	req := httptest.NewRequest(http.MethodPost, "/anthropic/v1/messages", nil)
+	toolsRaw := []any{
+		map[string]any{
+			"name": "Write",
+			"inputSchema": map[string]any{
+				"type": "object",
+				"properties": map[string]any{
+					"content": map[string]any{"type": "string"},
+					"taskId":  map[string]any{"type": "string"},
+				},
+			},
+		},
+	}
+
+	h.handleClaudeStreamRealtime(rec, req, resp, "claude-sonnet-4-5", []any{map[string]any{"role": "user", "content": "write"}}, false, false, []string{"Write"}, toolsRaw)
+
+	frames := parseClaudeFrames(t, rec.Body.String())
+	for _, f := range findClaudeFrames(frames, "content_block_delta") {
+		delta, _ := f.Payload["delta"].(map[string]any)
+		if delta["type"] != "input_json_delta" {
+			continue
+		}
+		partial := asString(delta["partial_json"])
+		var args map[string]any
+		if err := json.Unmarshal([]byte(partial), &args); err != nil {
+			t.Fatalf("decode partial_json failed: %v payload=%s", err, partial)
+		}
+		if args["content"] != `{"message":"hi"}` {
+			t.Fatalf("expected content normalized to string, got %#v", args["content"])
+		}
+		if args["taskId"] != "1" {
+			t.Fatalf("expected taskId normalized to string, got %#v", args["taskId"])
+		}
+		return
+	}
+	t.Fatalf("expected input_json_delta frame, body=%s", rec.Body.String())
+}
--- a/internal/httpapi/claude/handler_tokens.go
+++ b/internal/httpapi/claude/handler_tokens.go
@@ -3,8 +3,6 @@ package claude
 import (
 	"encoding/json"
 	"net/http"
-
-	"ds2api/internal/util"
 )

 func (h *Handler) CountTokens(w http.ResponseWriter, r *http.Request) {
@@ -26,26 +24,11 @@ func (h *Handler) CountTokens(w http.ResponseWriter, r *http.Request) {
 		writeClaudeError(w, http.StatusBadRequest, "Request must include 'model' and 'messages'.")
 		return
 	}
-	inputTokens := 0
-	if sys, ok := req["system"].(string); ok {
-		inputTokens += util.EstimateTokens(sys)
-	}
-	for _, item := range messages {
-		msg, ok := item.(map[string]any)
-		if !ok {
-			continue
-		}
-		inputTokens += 2
-		inputTokens += util.EstimateTokens(extractMessageContent(msg["content"]))
-	}
-	if tools, ok := req["tools"].([]any); ok {
-		for _, t := range tools {
-			b, _ := json.Marshal(t)
-			inputTokens += util.EstimateTokens(string(b))
-		}
-	}
-	if inputTokens < 1 {
-		inputTokens = 1
+	normalized, err := normalizeClaudeRequest(h.Store, req)
+	if err != nil {
+		writeClaudeError(w, http.StatusBadRequest, err.Error())
+		return
 	}
+	inputTokens := countClaudeInputTokens(normalized.Standard)
 	writeJSON(w, http.StatusOK, map[string]any{"input_tokens": inputTokens})
 }
--- a/internal/httpapi/claude/prompt_token_text.go
+++ b/internal/httpapi/claude/prompt_token_text.go
@@ -0,0 +1,7 @@
+package claude
+
+import "ds2api/internal/prompt"
+
+func buildClaudePromptTokenText(messages []any, thinkingEnabled bool) string {
+	return prompt.MessagesPrepareWithThinking(toMessageMaps(messages), thinkingEnabled)
+}
--- a/internal/httpapi/claude/standard_request.go
+++ b/internal/httpapi/claude/standard_request.go
@@ -48,16 +48,18 @@ func normalizeClaudeRequest(store ConfigReader, req map[string]any) (claudeNorma

 	return claudeNormalizedRequest{
 		Standard: promptcompat.StandardRequest{
-			Surface:        "anthropic_messages",
-			RequestedModel: strings.TrimSpace(model),
-			ResolvedModel:  dsModel,
-			ResponseModel:  strings.TrimSpace(model),
-			Messages:       payload["messages"].([]any),
-			FinalPrompt:    finalPrompt,
-			ToolNames:      toolNames,
-			Stream:         util.ToBool(req["stream"]),
-			Thinking:       thinkingEnabled,
-			Search:         searchEnabled,
+			Surface:         "anthropic_messages",
+			RequestedModel:  strings.TrimSpace(model),
+			ResolvedModel:   dsModel,
+			ResponseModel:   strings.TrimSpace(model),
+			Messages:        payload["messages"].([]any),
+			PromptTokenText: finalPrompt,
+			ToolsRaw:        toolsRequested,
+			FinalPrompt:     finalPrompt,
+			ToolNames:       toolNames,
+			Stream:          util.ToBool(req["stream"]),
+			Thinking:        thinkingEnabled,
+			Search:          searchEnabled,
 		},
 		NormalizedMessages: normalizedMessages,
 	}, nil
--- a/internal/httpapi/claude/standard_request_test.go
+++ b/internal/httpapi/claude/standard_request_test.go
@@ -32,11 +32,39 @@ func TestNormalizeClaudeRequest(t *testing.T) {
 	if len(norm.Standard.ToolNames) == 0 {
 		t.Fatalf("expected tool names")
 	}
+	if norm.Standard.ToolsRaw == nil {
+		t.Fatalf("expected ToolsRaw preserved for downstream normalization")
+	}
 	if norm.Standard.FinalPrompt == "" {
 		t.Fatalf("expected non-empty final prompt")
 	}
 }

+func TestNormalizeClaudeRequestSupportsCamelCaseInputSchemaPromptInjection(t *testing.T) {
+	t.Setenv("DS2API_CONFIG_JSON", `{}`)
+	store := config.LoadStore()
+	req := map[string]any{
+		"model": "claude-sonnet-4-5",
+		"messages": []any{
+			map[string]any{"role": "user", "content": "hello"},
+		},
+		"tools": []any{
+			map[string]any{
+				"name":        "todowrite",
+				"description": "Write todos",
+				"inputSchema": map[string]any{"type": "object", "properties": map[string]any{"todos": map[string]any{"type": "array"}}},
+			},
+		},
+	}
+	norm, err := normalizeClaudeRequest(store, req)
+	if err != nil {
+		t.Fatalf("normalize failed: %v", err)
+	}
+	if !containsStr(norm.Standard.FinalPrompt, `"type":"array"`) {
+		t.Fatalf("expected inputSchema to be injected into prompt, got=%q", norm.Standard.FinalPrompt)
+	}
+}
+
 func TestNormalizeClaudeRequestInjectsToolsIntoExistingSystemMessage(t *testing.T) {
 	t.Setenv("DS2API_CONFIG_JSON", `{}`)
 	store := config.LoadStore()
--- a/internal/httpapi/claude/stream_runtime_core.go
+++ b/internal/httpapi/claude/stream_runtime_core.go
@@ -15,9 +15,11 @@ type claudeStreamRuntime struct {
 	rc       *http.ResponseController
 	canFlush bool

-	model     string
-	toolNames []string
-	messages  []any
+	model           string
+	toolNames       []string
+	messages        []any
+	toolsRaw        any
+	promptTokenText string

 	thinkingEnabled       bool
 	searchEnabled         bool
@@ -47,6 +49,8 @@ func newClaudeStreamRuntime(
 	searchEnabled bool,
 	stripReferenceMarkers bool,
 	toolNames []string,
+	toolsRaw any,
+	promptTokenText string,
 ) *claudeStreamRuntime {
 	return &claudeStreamRuntime{
 		w:                     w,
@@ -59,6 +63,8 @@ func newClaudeStreamRuntime(
 		bufferToolContent:     len(toolNames) > 0,
 		stripReferenceMarkers: stripReferenceMarkers,
 		toolNames:             toolNames,
+		toolsRaw:              toolsRaw,
+		promptTokenText:       promptTokenText,
 		messageID:             fmt.Sprintf("msg_%d", time.Now().UnixNano()),
 		thinkingBlockIndex:    -1,
 		textBlockIndex:        -1,
--- a/internal/httpapi/claude/stream_runtime_emit.go
+++ b/internal/httpapi/claude/stream_runtime_emit.go
@@ -42,7 +42,10 @@ func (s *claudeStreamRuntime) sendPing() {
 }

 func (s *claudeStreamRuntime) sendMessageStart() {
-	inputTokens := util.EstimateTokens(fmt.Sprintf("%v", s.messages))
+	inputTokens := countClaudeInputTokensFromText(s.promptTokenText, s.model)
+	if inputTokens == 0 {
+		inputTokens = util.CountPromptTokens(fmt.Sprintf("%v", s.messages), s.model)
+	}
 	s.send("message_start", map[string]any{
 		"type": "message_start",
 		"message": map[string]any{
--- a/internal/httpapi/claude/stream_runtime_finalize.go
+++ b/internal/httpapi/claude/stream_runtime_finalize.go
@@ -52,6 +52,7 @@ func (s *claudeStreamRuntime) finalize(stopReason string) {
 			detected = toolcall.ParseStandaloneToolCalls(finalThinking, s.toolNames)
 		}
 		if len(detected) > 0 {
+			detected = toolcall.NormalizeParsedToolCallsForSchemas(detected, s.toolsRaw)
 			stopReason = "tool_use"
 			for i, tc := range detected {
 				idx := s.nextBlockIndex + i
@@ -108,7 +109,7 @@ func (s *claudeStreamRuntime) finalize(stopReason string) {
 		}
 	}

-	outputTokens := util.EstimateTokens(finalThinking) + util.EstimateTokens(finalText)
+	outputTokens := util.CountOutputTokens(finalThinking, s.model) + util.CountOutputTokens(finalText, s.model)
 	s.send("message_delta", map[string]any{
 		"type": "message_delta",
 		"delta": map[string]any{
--- a/internal/httpapi/claude/token_count.go
+++ b/internal/httpapi/claude/token_count.go
@@ -0,0 +1,20 @@
+package claude
+
+import (
+	"strings"
+
+	"ds2api/internal/promptcompat"
+	"ds2api/internal/util"
+)
+
+func countClaudeInputTokens(stdReq promptcompat.StandardRequest) int {
+	promptText := stdReq.PromptTokenText
+	if strings.TrimSpace(promptText) == "" {
+		promptText = stdReq.FinalPrompt
+	}
+	return countClaudeInputTokensFromText(promptText, stdReq.ResolvedModel)
+}
+
+func countClaudeInputTokensFromText(promptText, model string) int {
+	return util.CountPromptTokens(promptText, model)
+}
--- a/internal/httpapi/gemini/convert_request.go
+++ b/internal/httpapi/gemini/convert_request.go
@@ -36,16 +36,17 @@ func normalizeGeminiRequest(store ConfigReader, routeModel string, req map[strin
 	passThrough := collectGeminiPassThrough(req)

 	return promptcompat.StandardRequest{
-		Surface:        "google_gemini",
-		RequestedModel: requestedModel,
-		ResolvedModel:  resolvedModel,
-		ResponseModel:  requestedModel,
-		Messages:       messagesRaw,
-		FinalPrompt:    finalPrompt,
-		ToolNames:      toolNames,
-		Stream:         stream,
-		Thinking:       thinkingEnabled,
-		Search:         searchEnabled,
-		PassThrough:    passThrough,
+		Surface:         "google_gemini",
+		RequestedModel:  requestedModel,
+		ResolvedModel:   resolvedModel,
+		ResponseModel:   requestedModel,
+		Messages:        messagesRaw,
+		PromptTokenText: finalPrompt,
+		FinalPrompt:     finalPrompt,
+		ToolNames:       toolNames,
+		Stream:          stream,
+		Thinking:        thinkingEnabled,
+		Search:          searchEnabled,
+		PassThrough:     passThrough,
 	}, nil
 }
--- a/internal/httpapi/gemini/handler_generate.go
+++ b/internal/httpapi/gemini/handler_generate.go
@@ -227,7 +227,7 @@ func (h *Handler) handleNonStreamGenerateContent(w http.ResponseWriter, resp *ht
 //nolint:unused // retained for native Gemini non-stream handling path.
 func buildGeminiGenerateContentResponse(model, finalPrompt, finalThinking, finalText string, toolNames []string) map[string]any {
 	parts := buildGeminiPartsFromFinal(finalText, finalThinking, toolNames)
-	usage := buildGeminiUsage(finalPrompt, finalThinking, finalText)
+	usage := buildGeminiUsage(model, finalPrompt, finalThinking, finalText)
 	return map[string]any{
 		"candidates": []map[string]any{
 			{
@@ -245,10 +245,10 @@ func buildGeminiGenerateContentResponse(model, finalPrompt, finalThinking, final
 }

 //nolint:unused // retained for native Gemini non-stream handling path.
-func buildGeminiUsage(finalPrompt, finalThinking, finalText string) map[string]any {
-	promptTokens := util.EstimateTokens(finalPrompt)
-	reasoningTokens := util.EstimateTokens(finalThinking)
-	completionTokens := util.EstimateTokens(finalText)
+func buildGeminiUsage(model, finalPrompt, finalThinking, finalText string) map[string]any {
+	promptTokens := util.CountPromptTokens(finalPrompt, model)
+	reasoningTokens := util.CountOutputTokens(finalThinking, model)
+	completionTokens := util.CountOutputTokens(finalText, model)
 	return map[string]any{
 		"promptTokenCount":     promptTokens,
 		"candidatesTokenCount": reasoningTokens + completionTokens,
--- a/internal/httpapi/gemini/handler_stream_runtime.go
+++ b/internal/httpapi/gemini/handler_stream_runtime.go
@@ -194,6 +194,6 @@ func (s *geminiStreamRuntime) finalize() {
 			},
 		},
 		"modelVersion":  s.model,
-		"usageMetadata": buildGeminiUsage(s.finalPrompt, finalThinking, finalText),
+		"usageMetadata": buildGeminiUsage(s.model, s.finalPrompt, finalThinking, finalText),
 	})
 }
--- a/internal/httpapi/openai/chat/chat_history_test.go
+++ b/internal/httpapi/openai/chat/chat_history_test.go
@@ -126,6 +126,7 @@ func TestStartChatHistoryRecoversFromTransientWriteFailure(t *testing.T) {
 	session := startChatHistory(historyStore, req, a, stdReq)
 	if session == nil {
 		t.Fatalf("expected session even when initial persistence fails")
+		return
 	}
 	if session.disabled {
 		t.Fatalf("expected session to remain active after transient start failure")
@@ -194,7 +195,7 @@ func TestHandleStreamContextCancelledMarksHistoryStopped(t *testing.T) {
 	rec := httptest.NewRecorder()
 	resp := makeOpenAISSEHTTPResponse(`data: {"p":"response/content","v":"hello"}`, `data: [DONE]`)

-	h.handleStream(rec, req, resp, "cid-stop", "deepseek-v4-flash", "prompt", false, false, nil, nil, session)
+	h.handleStream(rec, req, resp, "cid-stop", "deepseek-v4-flash", "prompt", 0, false, false, nil, nil, session)

 	snapshot, err := historyStore.Snapshot()
 	if err != nil {
@@ -307,14 +308,14 @@ func TestChatCompletionsCurrentInputFilePersistsNeutralPrompt(t *testing.T) {
 	if err != nil {
 		t.Fatalf("expected detail item, got %v", err)
 	}
-	if full.HistoryText != "" {
-		t.Fatalf("expected current input file flow to leave history text empty, got %q", full.HistoryText)
-	}
 	if len(ds.uploadCalls) != 1 {
 		t.Fatalf("expected current input upload to happen, got %d", len(ds.uploadCalls))
 	}
-	if ds.uploadCalls[0].Filename != "IGNORE.txt" {
-		t.Fatalf("expected IGNORE.txt upload, got %q", ds.uploadCalls[0].Filename)
+	if ds.uploadCalls[0].Filename != "history.txt" {
+		t.Fatalf("expected history.txt upload, got %q", ds.uploadCalls[0].Filename)
+	}
+	if full.HistoryText != string(ds.uploadCalls[0].Data) {
+		t.Fatalf("expected uploaded current input file to be persisted in history text")
 	}
 	if len(full.Messages) != 1 {
 		t.Fatalf("expected neutral prompt to be the only persisted message, got %#v", full.Messages)
--- a/internal/httpapi/openai/chat/chat_stream_runtime.go
+++ b/internal/httpapi/openai/chat/chat_stream_runtime.go
@@ -16,12 +16,13 @@ type chatStreamRuntime struct {
 	rc       *http.ResponseController
 	canFlush bool

-	completionID string
-	created      int64
-	model        string
-	finalPrompt  string
-	toolNames    []string
-	toolsRaw     any
+	completionID  string
+	created       int64
+	model         string
+	finalPrompt   string
+	refFileTokens int
+	toolNames     []string
+	toolsRaw      any

 	thinkingEnabled       bool
 	searchEnabled         bool
@@ -36,8 +37,10 @@ type chatStreamRuntime struct {
 	toolSieve             toolstream.State
 	streamToolCallIDs     map[int]string
 	streamToolNames       map[int]string
+	rawThinking           strings.Builder
 	thinking              strings.Builder
 	toolDetectionThinking strings.Builder
+	rawText               strings.Builder
 	text                  strings.Builder
 	responseMessageID     int

@@ -50,6 +53,32 @@ type chatStreamRuntime struct {
 	finalErrorCode    string
 }

+type chatDeltaBatch struct {
+	runtime *chatStreamRuntime
+	field   string
+	text    strings.Builder
+}
+
+func (b *chatDeltaBatch) append(field, text string) {
+	if text == "" {
+		return
+	}
+	if b.field != "" && b.field != field {
+		b.flush()
+	}
+	b.field = field
+	b.text.WriteString(text)
+}
+
+func (b *chatDeltaBatch) flush() {
+	if b.field == "" || b.text.Len() == 0 {
+		return
+	}
+	b.runtime.sendDelta(map[string]any{b.field: b.text.String()})
+	b.field = ""
+	b.text.Reset()
+}
+
 func newChatStreamRuntime(
 	w http.ResponseWriter,
 	rc *http.ResponseController,
@@ -104,6 +133,23 @@ func (s *chatStreamRuntime) sendChunk(v any) {
 	}
 }

+func (s *chatStreamRuntime) sendDelta(delta map[string]any) {
+	if len(delta) == 0 {
+		return
+	}
+	if !s.firstChunkSent {
+		delta["role"] = "assistant"
+		s.firstChunkSent = true
+	}
+	s.sendChunk(openaifmt.BuildChatStreamChunk(
+		s.completionID,
+		s.created,
+		s.model,
+		[]map[string]any{openaifmt.BuildChatStreamDeltaChoice(0, delta)},
+		nil,
+	))
+}
+
 func (s *chatStreamRuntime) sendDone() {
 	_, _ = s.w.Write([]byte("data: [DONE]\n\n"))
 	if s.canFlush {
@@ -141,69 +187,37 @@ func (s *chatStreamRuntime) finalize(finishReason string, deferEmptyOutput bool)
 	finalText := cleanVisibleOutput(s.text.String(), s.stripReferenceMarkers)
 	s.finalThinking = finalThinking
 	s.finalText = finalText
-	detected := detectAssistantToolCalls(finalText, finalThinking, finalToolDetectionThinking, s.toolNames)
+	detected := detectAssistantToolCalls(s.rawText.String(), finalText, s.rawThinking.String(), finalToolDetectionThinking, s.toolNames)
 	if len(detected.Calls) > 0 && !s.toolCallsDoneEmitted {
 		finishReason = "tool_calls"
-		delta := map[string]any{
+		s.sendDelta(map[string]any{
 			"tool_calls": formatFinalStreamToolCallsWithStableIDs(detected.Calls, s.streamToolCallIDs, s.toolsRaw),
-		}
-		if !s.firstChunkSent {
-			delta["role"] = "assistant"
-			s.firstChunkSent = true
-		}
-		s.sendChunk(openaifmt.BuildChatStreamChunk(
-			s.completionID,
-			s.created,
-			s.model,
-			[]map[string]any{openaifmt.BuildChatStreamDeltaChoice(0, delta)},
-			nil,
-		))
+		})
 		s.toolCallsEmitted = true
 		s.toolCallsDoneEmitted = true
 	} else if s.bufferToolContent {
+		batch := chatDeltaBatch{runtime: s}
 		for _, evt := range toolstream.Flush(&s.toolSieve, s.toolNames) {
 			if len(evt.ToolCalls) > 0 {
+				batch.flush()
 				finishReason = "tool_calls"
 				s.toolCallsEmitted = true
 				s.toolCallsDoneEmitted = true
-				tcDelta := map[string]any{
+				s.sendDelta(map[string]any{
 					"tool_calls": formatFinalStreamToolCallsWithStableIDs(evt.ToolCalls, s.streamToolCallIDs, s.toolsRaw),
-				}
-				if !s.firstChunkSent {
-					tcDelta["role"] = "assistant"
-					s.firstChunkSent = true
-				}
-				s.sendChunk(openaifmt.BuildChatStreamChunk(
-					s.completionID,
-					s.created,
-					s.model,
-					[]map[string]any{openaifmt.BuildChatStreamDeltaChoice(0, tcDelta)},
-					nil,
-				))
+				})
 				s.resetStreamToolCallState()
 			}
 			if evt.Content == "" {
 				continue
 			}
 			cleaned := cleanVisibleOutput(evt.Content, s.stripReferenceMarkers)
-			if cleaned == "" {
+			if cleaned == "" || (s.searchEnabled && sse.IsCitation(cleaned)) {
 				continue
 			}
-			delta := map[string]any{
-				"content": cleaned,
-			}
-			if !s.firstChunkSent {
-				delta["role"] = "assistant"
-				s.firstChunkSent = true
-			}
-			s.sendChunk(openaifmt.BuildChatStreamChunk(
-				s.completionID,
-				s.created,
-				s.model,
-				[]map[string]any{openaifmt.BuildChatStreamDeltaChoice(0, delta)},
-				nil,
-			))
+			batch.append("content", cleaned)
 		}
+		batch.flush()
 	}

 	if len(detected.Calls) > 0 || s.toolCallsEmitted {
@@ -220,7 +234,7 @@ func (s *chatStreamRuntime) finalize(finishReason string, deferEmptyOutput bool)
 		s.sendFailedChunk(status, message, code)
 		return true
 	}
-	usage := openaifmt.BuildChatUsage(s.finalPrompt, finalThinking, finalText)
+	usage := openaifmt.BuildChatUsageForModel(s.model, s.finalPrompt, finalThinking, finalText, s.refFileTokens)
 	s.finalFinishReason = finishReason
 	s.finalUsage = usage
 	s.sendChunk(openaifmt.BuildChatStreamChunk(
@@ -254,8 +268,8 @@ func (s *chatStreamRuntime) onParsed(parsed sse.LineResult) streamengine.ParsedD
 		return streamengine.ParsedDecision{Stop: true, StopReason: streamengine.StopReasonHandlerRequested}
 	}

-	newChoices := make([]map[string]any, 0, len(parsed.Parts))
 	contentSeen := false
+	batch := chatDeltaBatch{runtime: s}
 	for _, p := range parsed.ToolDetectionThinkingParts {
 		trimmed := sse.TrimContinuationOverlap(s.toolDetectionThinking.String(), p.Text)
 		if trimmed != "" {
@@ -263,38 +277,46 @@ func (s *chatStreamRuntime) onParsed(parsed sse.LineResult) streamengine.ParsedD
 		}
 	}
 	for _, p := range parsed.Parts {
-		cleanedText := cleanVisibleOutput(p.Text, s.stripReferenceMarkers)
-		if s.searchEnabled && sse.IsCitation(cleanedText) {
-			continue
-		}
-		if cleanedText == "" {
-			continue
-		}
-		contentSeen = true
-		delta := map[string]any{}
-		if !s.firstChunkSent {
-			delta["role"] = "assistant"
-			s.firstChunkSent = true
-		}
 		if p.Type == "thinking" {
+			rawTrimmed := sse.TrimContinuationOverlap(s.rawThinking.String(), p.Text)
+			if rawTrimmed != "" {
+				s.rawThinking.WriteString(rawTrimmed)
+				contentSeen = true
+			}
 			if s.thinkingEnabled {
+				cleanedText := cleanVisibleOutput(rawTrimmed, s.stripReferenceMarkers)
+				if cleanedText == "" {
+					continue
+				}
 				trimmed := sse.TrimContinuationOverlap(s.thinking.String(), cleanedText)
 				if trimmed == "" {
 					continue
 				}
 				s.thinking.WriteString(trimmed)
-				delta["reasoning_content"] = trimmed
+				batch.append("reasoning_content", trimmed)
 			}
 		} else {
-			trimmed := sse.TrimContinuationOverlap(s.text.String(), cleanedText)
-			if trimmed == "" {
+			rawTrimmed := sse.TrimContinuationOverlap(s.rawText.String(), p.Text)
+			if rawTrimmed == "" {
 				continue
 			}
-			s.text.WriteString(trimmed)
+			s.rawText.WriteString(rawTrimmed)
+			contentSeen = true
+			cleanedText := cleanVisibleOutput(rawTrimmed, s.stripReferenceMarkers)
+			if s.searchEnabled && sse.IsCitation(cleanedText) {
+				continue
+			}
+			trimmed := sse.TrimContinuationOverlap(s.text.String(), cleanedText)
+			if trimmed != "" {
+				s.text.WriteString(trimmed)
+			}
 			if !s.bufferToolContent {
-				delta["content"] = trimmed
+				if trimmed == "" {
+					continue
+				}
+				batch.append("content", trimmed)
 			} else {
-				events := toolstream.ProcessChunk(&s.toolSieve, trimmed, s.toolNames)
+				events := toolstream.ProcessChunk(&s.toolSieve, rawTrimmed, s.toolNames)
 				for _, evt := range events {
 					if len(evt.ToolCallDeltas) > 0 {
 						if !s.emitEarlyToolDeltas {
@@ -308,55 +330,36 @@ func (s *chatStreamRuntime) onParsed(parsed sse.LineResult) streamengine.ParsedD
 						if len(formatted) == 0 {
 							continue
 						}
+						batch.flush()
 						tcDelta := map[string]any{
 							"tool_calls": formatted,
 						}
 						s.toolCallsEmitted = true
-						if !s.firstChunkSent {
-							tcDelta["role"] = "assistant"
-							s.firstChunkSent = true
-						}
-						newChoices = append(newChoices, openaifmt.BuildChatStreamDeltaChoice(0, tcDelta))
+						s.sendDelta(tcDelta)
 						continue
 					}
 					if len(evt.ToolCalls) > 0 {
+						batch.flush()
 						s.toolCallsEmitted = true
 						s.toolCallsDoneEmitted = true
 						tcDelta := map[string]any{
 							"tool_calls": formatFinalStreamToolCallsWithStableIDs(evt.ToolCalls, s.streamToolCallIDs, s.toolsRaw),
 						}
-						if !s.firstChunkSent {
-							tcDelta["role"] = "assistant"
-							s.firstChunkSent = true
-						}
-						newChoices = append(newChoices, openaifmt.BuildChatStreamDeltaChoice(0, tcDelta))
+						s.sendDelta(tcDelta)
 						s.resetStreamToolCallState()
 						continue
 					}
 					if evt.Content != "" {
 						cleaned := cleanVisibleOutput(evt.Content, s.stripReferenceMarkers)
-						if cleaned == "" {
+						if cleaned == "" || (s.searchEnabled && sse.IsCitation(cleaned)) {
 							continue
 						}
-						contentDelta := map[string]any{
-							"content": cleaned,
-						}
-						if !s.firstChunkSent {
-							contentDelta["role"] = "assistant"
-							s.firstChunkSent = true
-						}
-						newChoices = append(newChoices, openaifmt.BuildChatStreamDeltaChoice(0, contentDelta))
+						batch.append("content", cleaned)
 					}
 				}
 			}
 		}
-		if len(delta) > 0 {
-			newChoices = append(newChoices, openaifmt.BuildChatStreamDeltaChoice(0, delta))
-		}
-	}
-
-	if len(newChoices) > 0 {
-		s.sendChunk(openaifmt.BuildChatStreamChunk(s.completionID, s.created, s.model, newChoices, nil))
 	}
+	batch.flush()
 	return streamengine.ParsedDecision{ContentSeen: contentSeen}
 }
--- a/internal/httpapi/openai/chat/empty_retry_runtime.go
+++ b/internal/httpapi/openai/chat/empty_retry_runtime.go
@@ -16,6 +16,8 @@ import (
 )

 type chatNonStreamResult struct {
+	rawThinking           string
+	rawText               string
 	thinking              string
 	toolDetectionThinking string
 	text                  string
@@ -26,11 +28,12 @@ type chatNonStreamResult struct {
 	responseMessageID     int
 }

-func (h *Handler) handleNonStreamWithRetry(w http.ResponseWriter, ctx context.Context, a *auth.RequestAuth, resp *http.Response, payload map[string]any, pow, completionID, model, finalPrompt string, thinkingEnabled, searchEnabled bool, toolNames []string, toolsRaw any, historySession *chatHistorySession) {
+func (h *Handler) handleNonStreamWithRetry(w http.ResponseWriter, ctx context.Context, a *auth.RequestAuth, resp *http.Response, payload map[string]any, pow, completionID, model, finalPrompt string, refFileTokens int, thinkingEnabled, searchEnabled bool, toolNames []string, toolsRaw any, historySession *chatHistorySession) {
 	attempts := 0
 	currentResp := resp
 	usagePrompt := finalPrompt
 	accumulatedThinking := ""
+	accumulatedRawThinking := ""
 	accumulatedToolDetectionThinking := ""
 	for {
 		result, ok := h.collectChatNonStreamAttempt(w, currentResp, completionID, model, usagePrompt, thinkingEnabled, searchEnabled, toolNames, toolsRaw)
@@ -38,15 +41,18 @@ func (h *Handler) handleNonStreamWithRetry(w http.ResponseWriter, ctx context.Co
 			return
 		}
 		accumulatedThinking += sse.TrimContinuationOverlap(accumulatedThinking, result.thinking)
+		accumulatedRawThinking += sse.TrimContinuationOverlap(accumulatedRawThinking, result.rawThinking)
 		accumulatedToolDetectionThinking += sse.TrimContinuationOverlap(accumulatedToolDetectionThinking, result.toolDetectionThinking)
 		result.thinking = accumulatedThinking
+		result.rawThinking = accumulatedRawThinking
 		result.toolDetectionThinking = accumulatedToolDetectionThinking
-		detected := detectAssistantToolCalls(result.text, result.thinking, result.toolDetectionThinking, toolNames)
+		detected := detectAssistantToolCalls(result.rawText, result.text, result.rawThinking, result.toolDetectionThinking, toolNames)
 		result.detectedCalls = len(detected.Calls)
 		result.body = openaifmt.BuildChatCompletionWithToolCalls(completionID, model, usagePrompt, result.thinking, result.text, detected.Calls, toolsRaw)
+		addRefFileTokensToUsage(result.body, refFileTokens)
 		result.finishReason = chatFinishReason(result.body)
 		if !shouldRetryChatNonStream(result, attempts) {
-			h.finishChatNonStreamResult(w, result, attempts, usagePrompt, historySession)
+			h.finishChatNonStreamResult(w, result, attempts, usagePrompt, refFileTokens, historySession)
 			return
 		}

@@ -67,7 +73,7 @@ func (h *Handler) handleNonStreamWithRetry(w http.ResponseWriter, ctx context.Co
 			config.Logger.Warn("[openai_empty_retry] retry request failed", "surface", "chat.completions", "stream", false, "retry_attempt", attempts, "error", err)
 			return
 		}
-		usagePrompt = usagePromptWithEmptyOutputRetry(finalPrompt, attempts)
+		usagePrompt = usagePromptWithEmptyOutputRetry(usagePrompt, attempts)
 		currentResp = nextResp
 	}
 }
@@ -82,16 +88,17 @@ func (h *Handler) collectChatNonStreamAttempt(w http.ResponseWriter, resp *http.
 	result := sse.CollectStream(resp, thinkingEnabled, true)
 	stripReferenceMarkers := h.compatStripReferenceMarkers()
 	finalThinking := cleanVisibleOutput(result.Thinking, stripReferenceMarkers)
-	finalToolDetectionThinking := cleanVisibleOutput(result.ToolDetectionThinking, stripReferenceMarkers)
 	finalText := cleanVisibleOutput(result.Text, stripReferenceMarkers)
 	if searchEnabled {
 		finalText = replaceCitationMarkersWithLinks(finalText, result.CitationLinks)
 	}
-	detected := detectAssistantToolCalls(finalText, finalThinking, finalToolDetectionThinking, toolNames)
+	detected := detectAssistantToolCalls(result.Text, finalText, result.Thinking, result.ToolDetectionThinking, toolNames)
 	respBody := openaifmt.BuildChatCompletionWithToolCalls(completionID, model, usagePrompt, finalThinking, finalText, detected.Calls, toolsRaw)
 	return chatNonStreamResult{
+		rawThinking:           result.Thinking,
+		rawText:               result.Text,
 		thinking:              finalThinking,
-		toolDetectionThinking: finalToolDetectionThinking,
+		toolDetectionThinking: result.ToolDetectionThinking,
 		text:                  finalText,
 		contentFilter:         result.ContentFilter,
 		detectedCalls:         len(detected.Calls),
@@ -101,7 +108,7 @@ func (h *Handler) collectChatNonStreamAttempt(w http.ResponseWriter, resp *http.
 	}, true
 }

-func (h *Handler) finishChatNonStreamResult(w http.ResponseWriter, result chatNonStreamResult, attempts int, usagePrompt string, historySession *chatHistorySession) {
+func (h *Handler) finishChatNonStreamResult(w http.ResponseWriter, result chatNonStreamResult, attempts int, usagePrompt string, refFileTokens int, historySession *chatHistorySession) {
 	if result.detectedCalls == 0 && shouldWriteUpstreamEmptyOutputError(result.text) {
 		status, message, code := upstreamEmptyOutputDetail(result.contentFilter, result.text, result.thinking)
 		if historySession != nil {
@@ -112,7 +119,7 @@ func (h *Handler) finishChatNonStreamResult(w http.ResponseWriter, result chatNo
 		return
 	}
 	if historySession != nil {
-		historySession.success(http.StatusOK, result.thinking, result.text, result.finishReason, openaifmt.BuildChatUsage(usagePrompt, result.thinking, result.text))
+		historySession.success(http.StatusOK, result.thinking, result.text, result.finishReason, openaifmt.BuildChatUsageForModel("", usagePrompt, result.thinking, result.text, refFileTokens))
 	}
 	writeJSON(w, http.StatusOK, result.body)
 	source := "first_attempt"
@@ -139,8 +146,8 @@ func shouldRetryChatNonStream(result chatNonStreamResult, attempts int) bool {
 		strings.TrimSpace(result.text) == ""
 }

-func (h *Handler) handleStreamWithRetry(w http.ResponseWriter, r *http.Request, a *auth.RequestAuth, resp *http.Response, payload map[string]any, pow, completionID, model, finalPrompt string, thinkingEnabled, searchEnabled bool, toolNames []string, toolsRaw any, historySession *chatHistorySession) {
-	streamRuntime, initialType, ok := h.prepareChatStreamRuntime(w, resp, completionID, model, finalPrompt, thinkingEnabled, searchEnabled, toolNames, toolsRaw, historySession)
+func (h *Handler) handleStreamWithRetry(w http.ResponseWriter, r *http.Request, a *auth.RequestAuth, resp *http.Response, payload map[string]any, pow, completionID, model, finalPrompt string, refFileTokens int, thinkingEnabled, searchEnabled bool, toolNames []string, toolsRaw any, historySession *chatHistorySession) {
+	streamRuntime, initialType, ok := h.prepareChatStreamRuntime(w, resp, completionID, model, finalPrompt, refFileTokens, thinkingEnabled, searchEnabled, toolNames, toolsRaw, historySession)
 	if !ok {
 		return
 	}
@@ -182,7 +189,7 @@ func (h *Handler) handleStreamWithRetry(w http.ResponseWriter, r *http.Request,
 	}
 }

-func (h *Handler) prepareChatStreamRuntime(w http.ResponseWriter, resp *http.Response, completionID, model, finalPrompt string, thinkingEnabled, searchEnabled bool, toolNames []string, toolsRaw any, historySession *chatHistorySession) (*chatStreamRuntime, string, bool) {
+func (h *Handler) prepareChatStreamRuntime(w http.ResponseWriter, resp *http.Response, completionID, model, finalPrompt string, refFileTokens int, thinkingEnabled, searchEnabled bool, toolNames []string, toolsRaw any, historySession *chatHistorySession) (*chatStreamRuntime, string, bool) {
 	if resp.StatusCode != http.StatusOK {
 		defer func() { _ = resp.Body.Close() }()
 		body, _ := io.ReadAll(resp.Body)
@@ -210,6 +217,7 @@ func (h *Handler) prepareChatStreamRuntime(w http.ResponseWriter, resp *http.Res
 		thinkingEnabled, searchEnabled, h.compatStripReferenceMarkers(), toolNames, toolsRaw,
 		len(toolNames) > 0, h.toolcallFeatureMatchEnabled() && h.toolcallEarlyEmitHighConfidence(),
 	)
+	streamRuntime.refFileTokens = refFileTokens
 	return streamRuntime, initialType, true
 }

--- a/internal/httpapi/openai/chat/handler.go
+++ b/internal/httpapi/openai/chat/handler.go
@@ -148,6 +148,6 @@ func formatFinalStreamToolCallsWithStableIDs(calls []toolcall.ParsedToolCall, id
 	return shared.FormatFinalStreamToolCallsWithStableIDs(calls, ids, toolsRaw)
 }

-func detectAssistantToolCalls(text, exposedThinking, detectionThinking string, toolNames []string) toolcall.ToolCallParseResult {
-	return shared.DetectAssistantToolCalls(text, exposedThinking, detectionThinking, toolNames)
+func detectAssistantToolCalls(rawText, visibleText, exposedThinking, detectionThinking string, toolNames []string) toolcall.ToolCallParseResult {
+	return shared.DetectAssistantToolCalls(rawText, visibleText, exposedThinking, detectionThinking, toolNames)
 }
--- a/internal/httpapi/openai/chat/handler_chat.go
+++ b/internal/httpapi/openai/chat/handler_chat.go
@@ -108,11 +108,12 @@ func (h *Handler) ChatCompletions(w http.ResponseWriter, r *http.Request) {
 		writeOpenAIError(w, http.StatusInternalServerError, "Failed to get completion.")
 		return
 	}
+	refFileTokens := stdReq.RefFileTokens
 	if stdReq.Stream {
-		h.handleStreamWithRetry(w, r, a, resp, payload, pow, sessionID, stdReq.ResponseModel, stdReq.FinalPrompt, stdReq.Thinking, stdReq.Search, stdReq.ToolNames, stdReq.ToolsRaw, historySession)
+		h.handleStreamWithRetry(w, r, a, resp, payload, pow, sessionID, stdReq.ResponseModel, stdReq.PromptTokenText, refFileTokens, stdReq.Thinking, stdReq.Search, stdReq.ToolNames, stdReq.ToolsRaw, historySession)
 		return
 	}
-	h.handleNonStreamWithRetry(w, r.Context(), a, resp, payload, pow, sessionID, stdReq.ResponseModel, stdReq.FinalPrompt, stdReq.Thinking, stdReq.Search, stdReq.ToolNames, stdReq.ToolsRaw, historySession)
+	h.handleNonStreamWithRetry(w, r.Context(), a, resp, payload, pow, sessionID, stdReq.ResponseModel, stdReq.PromptTokenText, refFileTokens, stdReq.Thinking, stdReq.Search, stdReq.ToolNames, stdReq.ToolsRaw, historySession)
 }

 func (h *Handler) autoDeleteRemoteSession(ctx context.Context, a *auth.RequestAuth, sessionID string) {
@@ -148,7 +149,7 @@ func (h *Handler) autoDeleteRemoteSession(ctx context.Context, a *auth.RequestAu
 	}
 }

-func (h *Handler) handleNonStream(w http.ResponseWriter, resp *http.Response, completionID, model, finalPrompt string, thinkingEnabled, searchEnabled bool, toolNames []string, toolsRaw any, historySession *chatHistorySession) {
+func (h *Handler) handleNonStream(w http.ResponseWriter, resp *http.Response, completionID, model, finalPrompt string, refFileTokens int, thinkingEnabled, searchEnabled bool, toolNames []string, toolsRaw any, historySession *chatHistorySession) {
 	if resp.StatusCode != http.StatusOK {
 		defer func() { _ = resp.Body.Close() }()
 		body, _ := io.ReadAll(resp.Body)
@@ -162,12 +163,11 @@ func (h *Handler) handleNonStream(w http.ResponseWriter, resp *http.Response, co

 	stripReferenceMarkers := h.compatStripReferenceMarkers()
 	finalThinking := cleanVisibleOutput(result.Thinking, stripReferenceMarkers)
-	finalToolDetectionThinking := cleanVisibleOutput(result.ToolDetectionThinking, stripReferenceMarkers)
 	finalText := cleanVisibleOutput(result.Text, stripReferenceMarkers)
 	if searchEnabled {
 		finalText = replaceCitationMarkersWithLinks(finalText, result.CitationLinks)
 	}
-	detected := detectAssistantToolCalls(finalText, finalThinking, finalToolDetectionThinking, toolNames)
+	detected := detectAssistantToolCalls(result.Text, finalText, result.Thinking, result.ToolDetectionThinking, toolNames)
 	if shouldWriteUpstreamEmptyOutputError(finalText) && len(detected.Calls) == 0 {
 		status, message, code := upstreamEmptyOutputDetail(result.ContentFilter, finalText, finalThinking)
 		if historySession != nil {
@@ -177,6 +177,9 @@ func (h *Handler) handleNonStream(w http.ResponseWriter, resp *http.Response, co
 		return
 	}
 	respBody := openaifmt.BuildChatCompletionWithToolCalls(completionID, model, finalPrompt, finalThinking, finalText, detected.Calls, toolsRaw)
+	if refFileTokens > 0 {
+		addRefFileTokensToUsage(respBody, refFileTokens)
+	}
 	finishReason := "stop"
 	if choices, ok := respBody["choices"].([]map[string]any); ok && len(choices) > 0 {
 		if fr, _ := choices[0]["finish_reason"].(string); strings.TrimSpace(fr) != "" {
@@ -184,12 +187,12 @@ func (h *Handler) handleNonStream(w http.ResponseWriter, resp *http.Response, co
 		}
 	}
 	if historySession != nil {
-		historySession.success(http.StatusOK, finalThinking, finalText, finishReason, openaifmt.BuildChatUsage(finalPrompt, finalThinking, finalText))
+		historySession.success(http.StatusOK, finalThinking, finalText, finishReason, openaifmt.BuildChatUsageForModel(model, finalPrompt, finalThinking, finalText, refFileTokens))
 	}
 	writeJSON(w, http.StatusOK, respBody)
 }

-func (h *Handler) handleStream(w http.ResponseWriter, r *http.Request, resp *http.Response, completionID, model, finalPrompt string, thinkingEnabled, searchEnabled bool, toolNames []string, toolsRaw any, historySession *chatHistorySession) {
+func (h *Handler) handleStream(w http.ResponseWriter, r *http.Request, resp *http.Response, completionID, model, finalPrompt string, refFileTokens int, thinkingEnabled, searchEnabled bool, toolNames []string, toolsRaw any, historySession *chatHistorySession) {
 	defer func() { _ = resp.Body.Close() }()
 	if resp.StatusCode != http.StatusOK {
 		body, _ := io.ReadAll(resp.Body)
@@ -234,6 +237,7 @@ func (h *Handler) handleStream(w http.ResponseWriter, r *http.Request, resp *htt
 		bufferToolContent,
 		emitEarlyToolDeltas,
 	)
+	streamRuntime.refFileTokens = refFileTokens

 	streamengine.ConsumeSSE(streamengine.ConsumeConfig{
 		Context:             r.Context(),
--- a/internal/httpapi/openai/chat/handler_toolcall_test.go
+++ b/internal/httpapi/openai/chat/handler_toolcall_test.go
@@ -1,6 +1,7 @@
 package chat

 import (
+	"context"
 	"encoding/json"
 	"io"
 	"net/http"
@@ -93,7 +94,7 @@ func TestHandleNonStreamReturns429WhenUpstreamOutputEmpty(t *testing.T) {
 	)
 	rec := httptest.NewRecorder()

-	h.handleNonStream(rec, resp, "cid-empty", "deepseek-v4-flash", "prompt", false, false, nil, nil, nil)
+	h.handleNonStream(rec, resp, "cid-empty", "deepseek-v4-flash", "prompt", 0, false, false, nil, nil, nil)
 	if rec.Code != http.StatusTooManyRequests {
 		t.Fatalf("expected status 429 for empty upstream output, got %d body=%s", rec.Code, rec.Body.String())
 	}
@@ -112,7 +113,7 @@ func TestHandleNonStreamReturnsContentFilterErrorWhenUpstreamFilteredWithoutOutp
 	)
 	rec := httptest.NewRecorder()

-	h.handleNonStream(rec, resp, "cid-empty-filtered", "deepseek-v4-flash", "prompt", false, false, nil, nil, nil)
+	h.handleNonStream(rec, resp, "cid-empty-filtered", "deepseek-v4-flash", "prompt", 0, false, false, nil, nil, nil)
 	if rec.Code != http.StatusBadRequest {
 		t.Fatalf("expected status 400 for filtered upstream output, got %d body=%s", rec.Code, rec.Body.String())
 	}
@@ -131,7 +132,7 @@ func TestHandleNonStreamReturns429WhenUpstreamHasOnlyThinking(t *testing.T) {
 	)
 	rec := httptest.NewRecorder()

-	h.handleNonStream(rec, resp, "cid-thinking-only", "deepseek-v4-pro", "prompt", true, false, nil, nil, nil)
+	h.handleNonStream(rec, resp, "cid-thinking-only", "deepseek-v4-pro", "prompt", 0, true, false, nil, nil, nil)
 	if rec.Code != http.StatusTooManyRequests {
 		t.Fatalf("expected status 429 for thinking-only upstream output, got %d body=%s", rec.Code, rec.Body.String())
 	}
@@ -150,7 +151,7 @@ func TestHandleNonStreamPromotesThinkingToolCallsWhenTextEmpty(t *testing.T) {
 	)
 	rec := httptest.NewRecorder()

-	h.handleNonStream(rec, resp, "cid-thinking-tool", "deepseek-v4-pro", "prompt", true, false, []string{"search"}, nil, nil)
+	h.handleNonStream(rec, resp, "cid-thinking-tool", "deepseek-v4-pro", "prompt", 0, true, false, []string{"search"}, nil, nil)
 	if rec.Code != http.StatusOK {
 		t.Fatalf("expected 200 for thinking tool calls, got %d body=%s", rec.Code, rec.Body.String())
 	}
@@ -181,7 +182,7 @@ func TestHandleNonStreamPromotesHiddenThinkingDSMLToolCallsWhenTextEmpty(t *test
 	)
 	rec := httptest.NewRecorder()

-	h.handleNonStream(rec, resp, "cid-hidden-thinking-tool", "deepseek-v4-pro", "prompt", false, false, []string{"search"}, nil, nil)
+	h.handleNonStream(rec, resp, "cid-hidden-thinking-tool", "deepseek-v4-pro", "prompt", 0, false, false, []string{"search"}, nil, nil)
 	if rec.Code != http.StatusOK {
 		t.Fatalf("expected 200 for hidden thinking tool calls, got %d body=%s", rec.Code, rec.Body.String())
 	}
@@ -211,7 +212,7 @@ func TestHandleStreamToolsPlainTextStreamsBeforeFinish(t *testing.T) {
 	rec := httptest.NewRecorder()
 	req := httptest.NewRequest(http.MethodPost, "/v1/chat/completions", nil)

-	h.handleStream(rec, req, resp, "cid6", "deepseek-v4-flash", "prompt", false, false, []string{"search"}, nil, nil)
+	h.handleStream(rec, req, resp, "cid6", "deepseek-v4-flash", "prompt", 0, false, false, []string{"search"}, nil, nil)

 	frames, done := parseSSEDataFrames(t, rec.Body.String())
 	if !done {
@@ -239,6 +240,118 @@ func TestHandleStreamToolsPlainTextStreamsBeforeFinish(t *testing.T) {
 	}
 }

+func TestHandleStreamThinkingDisabledDoesNotLeakHiddenFragmentContinuations(t *testing.T) {
+	h := &Handler{}
+	resp := makeSSEHTTPResponse(
+		`data: {"p":"response/fragments","o":"APPEND","v":[{"type":"THINK","content":"我们"}]}`,
+		`data: {"p":"response/fragments/-1/content","v":"被"}`,
+		`data: {"v":"要求"}`,
+		`data: {"p":"response/fragments","o":"APPEND","v":[{"type":"RESPONSE","content":"答"}]}`,
+		`data: {"p":"response/fragments/-1/content","v":"案"}`,
+		`data: [DONE]`,
+	)
+	rec := httptest.NewRecorder()
+	req := httptest.NewRequest(http.MethodPost, "/v1/chat/completions", nil)
+
+	h.handleStream(rec, req, resp, "cid-hidden-fragment", "deepseek-v4-flash", "prompt", 0, false, false, nil, nil, nil)
+
+	frames, done := parseSSEDataFrames(t, rec.Body.String())
+	if !done {
+		t.Fatalf("expected [DONE], body=%s", rec.Body.String())
+	}
+	content := strings.Builder{}
+	for _, frame := range frames {
+		choices, _ := frame["choices"].([]any)
+		for _, item := range choices {
+			choice, _ := item.(map[string]any)
+			delta, _ := choice["delta"].(map[string]any)
+			if c, ok := delta["content"].(string); ok {
+				content.WriteString(c)
+			}
+		}
+	}
+	if got := content.String(); got != "答案" {
+		t.Fatalf("expected only visible response text, got %q body=%s", got, rec.Body.String())
+	}
+}
+
+func TestHandleStreamEmitsSingleChoiceFramesForMultipleParsedParts(t *testing.T) {
+	h := &Handler{}
+	resp := makeSSEHTTPResponse(
+		`data: {"p":"response/fragments","o":"APPEND","v":[{"type":"THINK","content":"我们"},{"type":"THINK","content":"被"},{"type":"THINK","content":"要求"},{"type":"RESPONSE","content":"答"},{"type":"RESPONSE","content":"案"}]}`,
+		`data: [DONE]`,
+	)
+	rec := httptest.NewRecorder()
+	req := httptest.NewRequest(http.MethodPost, "/v1/chat/completions", nil)
+
+	h.handleStream(rec, req, resp, "cid-multi-parts", "deepseek-v4-pro", "prompt", 0, true, false, nil, nil, nil)
+
+	frames, done := parseSSEDataFrames(t, rec.Body.String())
+	if !done {
+		t.Fatalf("expected [DONE], body=%s", rec.Body.String())
+	}
+	var reasoning, content strings.Builder
+	for _, frame := range frames {
+		choices, _ := frame["choices"].([]any)
+		if len(choices) != 1 {
+			t.Fatalf("expected exactly one choice per stream frame, got %d frame=%#v body=%s", len(choices), frame, rec.Body.String())
+		}
+		choice, _ := choices[0].(map[string]any)
+		delta, _ := choice["delta"].(map[string]any)
+		reasoning.WriteString(asString(delta["reasoning_content"]))
+		content.WriteString(asString(delta["content"]))
+	}
+	if got := reasoning.String(); got != "我们被要求" {
+		t.Fatalf("first-choice-only client would miss reasoning tokens: got %q body=%s", got, rec.Body.String())
+	}
+	if got := content.String(); got != "答案" {
+		t.Fatalf("first-choice-only client would miss content tokens: got %q body=%s", got, rec.Body.String())
+	}
+}
+
+func TestHandleStreamCoalescesSmallContentDeltas(t *testing.T) {
+	h := &Handler{}
+	lines := make([]string, 0, 101)
+	for i := 0; i < 100; i++ {
+		b, _ := json.Marshal(map[string]any{
+			"p": "response/content",
+			"v": "字",
+		})
+		lines = append(lines, "data: "+string(b))
+	}
+	lines = append(lines, "data: [DONE]")
+	resp := makeSSEHTTPResponse(lines...)
+	rec := httptest.NewRecorder()
+	req := httptest.NewRequest(http.MethodPost, "/v1/chat/completions", nil)
+
+	h.handleStream(rec, req, resp, "cid-coalesce", "deepseek-v4-flash", "prompt", 0, false, false, nil, nil, nil)
+
+	frames, done := parseSSEDataFrames(t, rec.Body.String())
+	if !done {
+		t.Fatalf("expected [DONE], body=%s", rec.Body.String())
+	}
+	var content strings.Builder
+	contentDeltaFrames := 0
+	for _, frame := range frames {
+		choices, _ := frame["choices"].([]any)
+		if len(choices) != 1 {
+			t.Fatalf("expected exactly one choice per stream frame, got %d frame=%#v body=%s", len(choices), frame, rec.Body.String())
+		}
+		choice, _ := choices[0].(map[string]any)
+		delta, _ := choice["delta"].(map[string]any)
+		if c, ok := delta["content"].(string); ok {
+			contentDeltaFrames++
+			content.WriteString(c)
+		}
+	}
+	if got, want := content.String(), strings.Repeat("字", 100); got != want {
+		t.Fatalf("coalesced stream content mismatch: got %q want %q body=%s", got, want, rec.Body.String())
+	}
+	if contentDeltaFrames >= 100 {
+		t.Fatalf("expected coalescing to reduce 100 tiny content frames, got %d body=%s", contentDeltaFrames, rec.Body.String())
+	}
+}
+
 func TestHandleStreamIncompleteCapturedToolJSONFlushesAsTextOnFinalize(t *testing.T) {
 	h := &Handler{}
 	resp := makeSSEHTTPResponse(
@@ -248,7 +361,7 @@ func TestHandleStreamIncompleteCapturedToolJSONFlushesAsTextOnFinalize(t *testin
 	rec := httptest.NewRecorder()
 	req := httptest.NewRequest(http.MethodPost, "/v1/chat/completions", nil)

-	h.handleStream(rec, req, resp, "cid10", "deepseek-v4-flash", "prompt", false, false, []string{"search"}, nil, nil)
+	h.handleStream(rec, req, resp, "cid10", "deepseek-v4-flash", "prompt", 0, false, false, []string{"search"}, nil, nil)

 	frames, done := parseSSEDataFrames(t, rec.Body.String())
 	if !done {
@@ -282,7 +395,7 @@ func TestHandleStreamPromotesThinkingToolCallsOnFinalizeWithoutMidstreamIntercep
 	rec := httptest.NewRecorder()
 	req := httptest.NewRequest(http.MethodPost, "/v1/chat/completions", nil)

-	h.handleStream(rec, req, resp, "cid-thinking-stream", "deepseek-v4-pro", "prompt", true, false, []string{"search"}, nil, nil)
+	h.handleStream(rec, req, resp, "cid-thinking-stream", "deepseek-v4-pro", "prompt", 0, true, false, []string{"search"}, nil, nil)

 	frames, done := parseSSEDataFrames(t, rec.Body.String())
 	if !done {
@@ -291,20 +404,16 @@ func TestHandleStreamPromotesThinkingToolCallsOnFinalizeWithoutMidstreamIntercep
 	if !streamHasToolCallsDelta(frames) {
 		t.Fatalf("expected tool_calls delta from finalize fallback, body=%s", rec.Body.String())
 	}
-	reasoningSeen := false
 	for _, frame := range frames {
 		choices, _ := frame["choices"].([]any)
 		for _, item := range choices {
 			choice, _ := item.(map[string]any)
 			delta, _ := choice["delta"].(map[string]any)
 			if asString(delta["reasoning_content"]) != "" {
-				reasoningSeen = true
+				t.Fatalf("did not expect leaked reasoning_content markup, body=%s", rec.Body.String())
 			}
 		}
 	}
-	if !reasoningSeen {
-		t.Fatalf("expected reasoning_content to stream before finalize fallback, body=%s", rec.Body.String())
-	}
 	if streamFinishReason(frames) != "tool_calls" {
 		t.Fatalf("expected finish_reason=tool_calls, body=%s", rec.Body.String())
 	}
@@ -319,7 +428,7 @@ func TestHandleStreamPromotesHiddenThinkingDSMLToolCallsOnFinalize(t *testing.T)
 	rec := httptest.NewRecorder()
 	req := httptest.NewRequest(http.MethodPost, "/v1/chat/completions", nil)

-	h.handleStream(rec, req, resp, "cid-hidden-thinking-stream", "deepseek-v4-pro", "prompt", false, false, []string{"search"}, nil, nil)
+	h.handleStream(rec, req, resp, "cid-hidden-thinking-stream", "deepseek-v4-pro", "prompt", 0, false, false, []string{"search"}, nil, nil)

 	frames, done := parseSSEDataFrames(t, rec.Body.String())
 	if !done {
@@ -353,7 +462,7 @@ func TestHandleStreamEmitsDistinctToolCallIDsAcrossSeparateToolBlocks(t *testing
 	rec := httptest.NewRecorder()
 	req := httptest.NewRequest(http.MethodPost, "/v1/chat/completions", nil)

-	h.handleStream(rec, req, resp, "cid-multi", "deepseek-v4-flash", "prompt", false, false, []string{"read_file", "search"}, nil, nil)
+	h.handleStream(rec, req, resp, "cid-multi", "deepseek-v4-flash", "prompt", 0, false, false, []string{"read_file", "search"}, nil, nil)

 	frames, done := parseSSEDataFrames(t, rec.Body.String())
 	if !done {
@@ -419,7 +528,7 @@ func TestHandleStreamCoercesSchemaDeclaredStringArgumentsOnFinalize(t *testing.T
 		},
 	}

-	h.handleStream(rec, req, resp, "cid-string-protect", "deepseek-v4-flash", "prompt", false, false, []string{"Write"}, toolsRaw, nil)
+	h.handleStream(rec, req, resp, "cid-string-protect", "deepseek-v4-flash", "prompt", 0, false, false, []string{"Write"}, toolsRaw, nil)

 	frames, done := parseSSEDataFrames(t, rec.Body.String())
 	if !done {
@@ -451,3 +560,45 @@ func TestHandleStreamCoercesSchemaDeclaredStringArgumentsOnFinalize(t *testing.T
 	}
 	t.Fatalf("expected at least one streamed tool call delta, body=%s", rec.Body.String())
 }
+
+func TestHandleNonStreamWithRetryIncludesRefFileTokensInUsage(t *testing.T) {
+	h := &Handler{}
+
+	run := func(refFileTokens int) map[string]any {
+		resp := makeSSEHTTPResponse(
+			`data: {"p":"response/content","v":"hello world"}`,
+			`data: [DONE]`,
+		)
+		rec := httptest.NewRecorder()
+		h.handleNonStreamWithRetry(rec, context.Background(), nil, resp, nil, "", "cid-ref", "deepseek-v4-flash", "prompt", refFileTokens, false, false, nil, nil, nil)
+		if rec.Code != http.StatusOK {
+			t.Fatalf("expected 200, got %d body=%s", rec.Code, rec.Body.String())
+		}
+		return decodeJSONBody(t, rec.Body.String())
+	}
+
+	base := run(0)
+	withRef := run(7)
+
+	baseUsage, _ := base["usage"].(map[string]any)
+	refUsage, _ := withRef["usage"].(map[string]any)
+	if baseUsage == nil || refUsage == nil {
+		t.Fatalf("expected usage objects, base=%#v ref=%#v", base["usage"], withRef["usage"])
+	}
+
+	getInt := func(m map[string]any, key string) int {
+		t.Helper()
+		v, ok := m[key].(float64)
+		if !ok {
+			t.Fatalf("expected numeric %s, got %#v", key, m[key])
+		}
+		return int(v)
+	}
+
+	if got := getInt(refUsage, "prompt_tokens") - getInt(baseUsage, "prompt_tokens"); got != 7 {
+		t.Fatalf("expected prompt_tokens delta 7, got %d", got)
+	}
+	if got := getInt(refUsage, "total_tokens") - getInt(baseUsage, "total_tokens"); got != 7 {
+		t.Fatalf("expected total_tokens delta 7, got %d", got)
+	}
+}
--- a/internal/httpapi/openai/chat/ref_file_tokens.go
+++ b/internal/httpapi/openai/chat/ref_file_tokens.go
@@ -0,0 +1,26 @@
+package chat
+
+// addRefFileTokensToUsage adds inline-uploaded file token estimates to an existing
+// usage map inside a response object. This keeps the token accounting aware of file
+// content that the upstream model processes but that is not part of the prompt text.
+func addRefFileTokensToUsage(obj map[string]any, refFileTokens int) {
+	if refFileTokens <= 0 || obj == nil {
+		return
+	}
+	usage, ok := obj["usage"].(map[string]any)
+	if !ok || usage == nil {
+		return
+	}
+	for _, key := range []string{"input_tokens", "prompt_tokens"} {
+		if v, ok := usage[key]; ok {
+			if n, ok := v.(int); ok {
+				usage[key] = n + refFileTokens
+			}
+		}
+	}
+	if v, ok := usage["total_tokens"]; ok {
+		if n, ok := v.(int); ok {
+			usage["total_tokens"] = n + refFileTokens
+		}
+	}
+}
--- a/internal/httpapi/openai/citation_links_test.go
+++ b/internal/httpapi/openai/citation_links_test.go
@@ -54,3 +54,31 @@ func TestReplaceCitationMarkersWithLinksSupportsReferenceZeroBased(t *testing.T)
 		t.Fatalf("expected %q, got %q", want, got)
 	}
 }
+
+func TestReplaceCitationMarkersWithLinksKeepsCitationOneBasedWithZeroBasedReference(t *testing.T) {
+	raw := "引用[citation:1]，来源[reference:0]，后续[reference:1]。"
+	links := map[int]string{
+		1: "https://example.com/first",
+		2: "https://example.com/second",
+	}
+
+	got := replaceCitationMarkersWithLinks(raw, links)
+	want := "引用[1](https://example.com/first)，来源[0](https://example.com/first)，后续[1](https://example.com/second)。"
+	if got != want {
+		t.Fatalf("expected %q, got %q", want, got)
+	}
+}
+
+func TestReplaceCitationMarkersWithLinksDetectsSpacedReferenceZeroBased(t *testing.T) {
+	raw := "来源[reference: 0] 与 [reference: 1]。"
+	links := map[int]string{
+		1: "https://example.com/first",
+		2: "https://example.com/second",
+	}
+
+	got := replaceCitationMarkersWithLinks(raw, links)
+	want := "来源[0](https://example.com/first) 与 [1](https://example.com/second)。"
+	if got != want {
+		t.Fatalf("expected %q, got %q", want, got)
+	}
+}
--- a/internal/httpapi/openai/file_inline_upload_test.go
+++ b/internal/httpapi/openai/file_inline_upload_test.go
@@ -216,6 +216,45 @@ func TestChatCompletionsInlineUploadFailureReturnsBadRequest(t *testing.T) {
 	}
 }

+func TestChatCompletionsInlineUploadLimitReturnsBadRequest(t *testing.T) {
+	ds := &inlineUploadDSStub{}
+	h := &openAITestSurface{Store: mockOpenAIConfig{wideInput: true}, Auth: streamStatusAuthStub{}, DS: ds}
+	content := []any{map[string]any{"type": "input_text", "text": "hi"}}
+	for i := 0; i < 51; i++ {
+		content = append(content, map[string]any{
+			"type":      "image_url",
+			"image_url": map[string]any{"url": "data:image/png;base64,QUJDRA=="},
+		})
+	}
+	body, err := json.Marshal(map[string]any{
+		"model": "deepseek-v4-flash",
+		"messages": []any{map[string]any{
+			"role":    "user",
+			"content": content,
+		}},
+		"stream": false,
+	})
+	if err != nil {
+		t.Fatalf("marshal request: %v", err)
+	}
+	req := httptest.NewRequest(http.MethodPost, "/v1/chat/completions", strings.NewReader(string(body)))
+	req.Header.Set("Authorization", "Bearer direct-token")
+	req.Header.Set("Content-Type", "application/json")
+	rec := httptest.NewRecorder()
+
+	h.ChatCompletions(rec, req)
+
+	if rec.Code != http.StatusBadRequest {
+		t.Fatalf("expected 400, got %d body=%s", rec.Code, rec.Body.String())
+	}
+	if !strings.Contains(rec.Body.String(), "exceeded maximum of 50 inline files per request") {
+		t.Fatalf("expected inline file limit error, got body=%s", rec.Body.String())
+	}
+	if ds.completionReq != nil {
+		t.Fatalf("did not expect completion call after inline file limit error")
+	}
+}
+
 func TestResponsesInlineUploadFailureReturnsInternalServerError(t *testing.T) {
 	ds := &inlineUploadDSStub{uploadErr: errors.New("boom")}
 	h := &openAITestSurface{Store: mockOpenAIConfig{wideInput: true}, Auth: streamStatusAuthStub{}, DS: ds}
--- a/internal/httpapi/openai/files/file_inline_upload.go
+++ b/internal/httpapi/openai/files/file_inline_upload.go
@@ -39,11 +39,12 @@ func (e *inlineFileUploadError) Error() string {
 }

 type inlineUploadState struct {
-	ctx          context.Context
-	handler      *Handler
-	auth         *auth.RequestAuth
-	uploadedByID map[string]string
-	uploadCount  int
+	ctx             context.Context
+	handler         *Handler
+	auth            *auth.RequestAuth
+	uploadedByID    map[string]string
+	uploadCount     int
+	inlineFileBytes int
 }

 type inlineDecodedFile struct {
@@ -75,6 +76,9 @@ func (h *Handler) PreprocessInlineFileInputs(ctx context.Context, a *auth.Reques
 	if refIDs := promptcompat.CollectOpenAIRefFileIDs(req); len(refIDs) > 0 {
 		req["ref_file_ids"] = stringsToAnySlice(refIDs)
 	}
+	if state.inlineFileBytes > 0 {
+		req["_inline_file_bytes"] = state.inlineFileBytes
+	}
 	return nil
 }

@@ -135,13 +139,15 @@ func (s *inlineUploadState) tryUploadBlock(block map[string]any) (map[string]any
 		return nil, false, nil
 	}
 	if s.uploadCount >= maxInlineFilesPerRequest {
-		return nil, true, fmt.Errorf("exceeded maximum of %d inline files per request", maxInlineFilesPerRequest)
+		err := fmt.Errorf("exceeded maximum of %d inline files per request", maxInlineFilesPerRequest)
+		return nil, true, &inlineFileUploadError{status: http.StatusBadRequest, message: err.Error(), err: err}
 	}
 	fileID, err := s.uploadInlineFile(decoded)
 	if err != nil {
 		return nil, true, &inlineFileUploadError{status: http.StatusInternalServerError, message: "Failed to upload inline file.", err: err}
 	}
 	s.uploadCount++
+	s.inlineFileBytes += len(decoded.Data)
 	replacement := map[string]any{
 		"type":    decoded.ReplacementType,
 		"file_id": fileID,
--- a/internal/httpapi/openai/history/current_input_file.go
+++ b/internal/httpapi/openai/history/current_input_file.go
@@ -13,7 +13,7 @@ import (
 )

 const (
-	currentInputFilename    = "IGNORE.txt"
+	currentInputFilename    = promptcompat.CurrentInputContextFilename
 	currentInputContentType = "text/plain; charset=utf-8"
 	currentInputPurpose     = "assistants"
 )
@@ -35,7 +35,6 @@ func (s Service) ApplyCurrentInputFile(ctx context.Context, a *auth.RequestAuth,
 	if strings.TrimSpace(fileText) == "" {
 		return stdReq, errors.New("current user input file produced empty transcript")
 	}
-
 	result, err := s.DS.UploadFile(ctx, a, dsclient.UploadFileRequest{
 		Filename:    currentInputFilename,
 		ContentType: currentInputContentType,
@@ -58,9 +57,13 @@ func (s Service) ApplyCurrentInputFile(ctx context.Context, a *auth.RequestAuth,
 	}

 	stdReq.Messages = messages
+	stdReq.HistoryText = fileText
 	stdReq.CurrentInputFileApplied = true
 	stdReq.RefFileIDs = prependUniqueRefFileID(stdReq.RefFileIDs, fileID)
 	stdReq.FinalPrompt, stdReq.ToolNames = promptcompat.BuildOpenAIPrompt(messages, stdReq.ToolsRaw, "", stdReq.ToolChoice, stdReq.Thinking)
+	// Token accounting must reflect the actual downstream context:
+	// the uploaded history.txt file content + the neutral live prompt.
+	stdReq.PromptTokenText = fileText + "\n" + stdReq.FinalPrompt
 	return stdReq, nil
 }

--- a/internal/httpapi/openai/history_split_test.go
+++ b/internal/httpapi/openai/history_split_test.go
@@ -14,6 +14,7 @@ import (
 	"ds2api/internal/auth"
 	dsclient "ds2api/internal/deepseek/client"
 	"ds2api/internal/promptcompat"
+	"ds2api/internal/util"
 )

 func historySplitTestMessages() []any {
@@ -64,8 +65,8 @@ func TestBuildOpenAICurrentInputContextTranscriptUsesInjectedFileWrapper(t *test
 	_, historyMessages := splitOpenAIHistoryMessages(historySplitTestMessages(), 1)
 	transcript := buildOpenAICurrentInputContextTranscript(historyMessages)

-	if !strings.HasPrefix(transcript, "[file content end]\n\n") {
-		t.Fatalf("expected injected file wrapper prefix, got %q", transcript)
+	if strings.Contains(transcript, "[file content end]") || strings.Contains(transcript, "[file content begin]") || strings.Contains(transcript, "[file name]:") {
+		t.Fatalf("expected plain transcript without file wrapper tags, got %q", transcript)
 	}
 	if !strings.Contains(transcript, "<｜begin▁of▁sentence｜>") {
 		t.Fatalf("expected serialized conversation markers, got %q", transcript)
@@ -79,9 +80,7 @@ func TestBuildOpenAICurrentInputContextTranscriptUsesInjectedFileWrapper(t *test
 	if !strings.Contains(transcript, "<|DSML|tool_calls>") {
 		t.Fatalf("expected tool calls preserved, got %q", transcript)
 	}
-	if !strings.HasSuffix(transcript, "\n[file name]: IGNORE\n[file content begin]\n") {
-		t.Fatalf("expected injected file wrapper suffix, got %q", transcript)
-	}
+
 }

 func TestSplitOpenAIHistoryMessagesUsesLatestUserTurn(t *testing.T) {
@@ -274,12 +273,12 @@ func TestApplyCurrentInputFileUploadsFirstTurnWithInjectedWrapper(t *testing.T)
 		t.Fatalf("expected 1 current input upload, got %d", len(ds.uploadCalls))
 	}
 	upload := ds.uploadCalls[0]
-	if upload.Filename != "IGNORE.txt" {
+	if upload.Filename != "history.txt" {
 		t.Fatalf("unexpected upload filename: %q", upload.Filename)
 	}
 	uploadedText := string(upload.Data)
-	if !strings.HasPrefix(uploadedText, "[file content end]\n\n") {
-		t.Fatalf("expected injected file wrapper prefix, got %q", uploadedText)
+	if strings.Contains(uploadedText, "[file content end]") || strings.Contains(uploadedText, "[file content begin]") || strings.Contains(uploadedText, "[file name]:") {
+		t.Fatalf("expected uploaded transcript without file wrapper tags, got %q", uploadedText)
 	}
 	if !strings.Contains(uploadedText, "<｜begin▁of▁sentence｜><｜User｜>first turn content that is long enough") {
 		t.Fatalf("expected serialized current user turn markers, got %q", uploadedText)
@@ -287,13 +286,11 @@ func TestApplyCurrentInputFileUploadsFirstTurnWithInjectedWrapper(t *testing.T)
 	if !strings.Contains(uploadedText, promptcompat.ThinkingInjectionMarker) {
 		t.Fatalf("expected thinking injection in current input file, got %q", uploadedText)
 	}
-	if !strings.HasSuffix(uploadedText, "\n[file name]: IGNORE\n[file content begin]\n") {
-		t.Fatalf("expected injected file wrapper suffix, got %q", uploadedText)
-	}
+
 	if strings.Contains(out.FinalPrompt, "first turn content that is long enough") {
 		t.Fatalf("expected current input text to be replaced in live prompt, got %s", out.FinalPrompt)
 	}
-	if strings.Contains(out.FinalPrompt, "CURRENT_USER_INPUT.txt") || strings.Contains(out.FinalPrompt, "IGNORE.txt") || strings.Contains(out.FinalPrompt, "Read that file") {
+	if strings.Contains(out.FinalPrompt, "CURRENT_USER_INPUT.txt") || strings.Contains(out.FinalPrompt, "history.txt") || strings.Contains(out.FinalPrompt, "Read that file") {
 		t.Fatalf("expected live prompt not to instruct file reads, got %s", out.FinalPrompt)
 	}
 	if !strings.Contains(out.FinalPrompt, "Answer the latest user request directly.") {
@@ -302,6 +299,52 @@ func TestApplyCurrentInputFileUploadsFirstTurnWithInjectedWrapper(t *testing.T)
 	if len(out.RefFileIDs) != 1 || out.RefFileIDs[0] != "file-inline-1" {
 		t.Fatalf("expected current input file id in ref_file_ids, got %#v", out.RefFileIDs)
 	}
+	if !strings.Contains(out.PromptTokenText, "first turn content that is long enough") {
+		t.Fatalf("expected prompt token text to preserve original full context, got %q", out.PromptTokenText)
+	}
+}
+
+func TestApplyCurrentInputFilePreservesFullContextPromptForTokenCounting(t *testing.T) {
+	ds := &inlineUploadDSStub{}
+	h := &openAITestSurface{
+		Store: mockOpenAIConfig{
+			wideInput:           true,
+			currentInputEnabled: true,
+			currentInputMin:     0,
+			thinkingInjection:   boolPtr(true),
+		},
+		DS: ds,
+	}
+	req := map[string]any{
+		"model":    "deepseek-v4-flash",
+		"messages": historySplitTestMessages(),
+	}
+	stdReq, err := promptcompat.NormalizeOpenAIChatRequest(h.Store, req, "")
+	if err != nil {
+		t.Fatalf("normalize failed: %v", err)
+	}
+
+	out, err := h.applyCurrentInputFile(context.Background(), &auth.RequestAuth{DeepSeekToken: "token"}, stdReq)
+	if err != nil {
+		t.Fatalf("apply current input file failed: %v", err)
+	}
+	if out.FinalPrompt == stdReq.FinalPrompt {
+		t.Fatalf("expected live prompt to be rewritten after current input file")
+	}
+	// PromptTokenText must include the uploaded file content (which contains the full context)
+	// plus the neutral live prompt — reflecting the actual downstream token cost.
+	if !strings.Contains(out.PromptTokenText, "first user turn") || !strings.Contains(out.PromptTokenText, "latest user turn") {
+		t.Fatalf("expected prompt token text to contain file context with full conversation, got %q", out.PromptTokenText)
+	}
+	if strings.Contains(out.PromptTokenText, "[file content end]") || strings.Contains(out.PromptTokenText, "[file name]:") {
+		t.Fatalf("expected prompt token text to use raw transcript without wrapper tags, got %q", out.PromptTokenText)
+	}
+	if !strings.Contains(out.PromptTokenText, "Answer the latest user request directly.") {
+		t.Fatalf("expected prompt token text to also include neutral live prompt, got %q", out.PromptTokenText)
+	}
+	if strings.Contains(out.FinalPrompt, "first user turn") || strings.Contains(out.FinalPrompt, "latest user turn") {
+		t.Fatalf("expected live prompt to hide original turns, got %q", out.FinalPrompt)
+	}
 }

 func TestApplyCurrentInputFileUploadsFullContextFile(t *testing.T) {
@@ -335,8 +378,8 @@ func TestApplyCurrentInputFileUploadsFullContextFile(t *testing.T) {
 		t.Fatalf("expected one current input upload, got %d", len(ds.uploadCalls))
 	}
 	upload := ds.uploadCalls[0]
-	if upload.Filename != "IGNORE.txt" {
-		t.Fatalf("expected IGNORE.txt upload, got %q", upload.Filename)
+	if upload.Filename != "history.txt" {
+		t.Fatalf("expected history.txt upload, got %q", upload.Filename)
 	}
 	uploadedText := string(upload.Data)
 	for _, want := range []string{"system instructions", "first user turn", "hidden reasoning", "tool result", "latest user turn", promptcompat.ThinkingInjectionMarker} {
@@ -344,7 +387,7 @@ func TestApplyCurrentInputFileUploadsFullContextFile(t *testing.T) {
 			t.Fatalf("expected full context file to contain %q, got %q", want, uploadedText)
 		}
 	}
-	if strings.Contains(out.FinalPrompt, "first user turn") || strings.Contains(out.FinalPrompt, "latest user turn") || strings.Contains(out.FinalPrompt, "CURRENT_USER_INPUT.txt") || strings.Contains(out.FinalPrompt, "IGNORE.txt") || strings.Contains(out.FinalPrompt, "Read that file") {
+	if strings.Contains(out.FinalPrompt, "first user turn") || strings.Contains(out.FinalPrompt, "latest user turn") || strings.Contains(out.FinalPrompt, "CURRENT_USER_INPUT.txt") || strings.Contains(out.FinalPrompt, "history.txt") || strings.Contains(out.FinalPrompt, "Read that file") {
 		t.Fatalf("expected live prompt to use only a neutral continuation instruction, got %s", out.FinalPrompt)
 	}
 	if !strings.Contains(out.FinalPrompt, "Answer the latest user request directly.") {
@@ -352,7 +395,7 @@ func TestApplyCurrentInputFileUploadsFullContextFile(t *testing.T) {
 	}
 }

-func TestApplyCurrentInputFileLeavesHistoryTextEmpty(t *testing.T) {
+func TestApplyCurrentInputFileCarriesHistoryText(t *testing.T) {
 	ds := &inlineUploadDSStub{}
 	h := &openAITestSurface{
 		Store: mockOpenAIConfig{
@@ -377,8 +420,8 @@ func TestApplyCurrentInputFileLeavesHistoryTextEmpty(t *testing.T) {
 	if len(ds.uploadCalls) != 1 {
 		t.Fatalf("expected 1 upload call, got %d", len(ds.uploadCalls))
 	}
-	if out.HistoryText != "" {
-		t.Fatalf("expected current input file flow to leave history text empty, got %q", out.HistoryText)
+	if out.HistoryText != string(ds.uploadCalls[0].Data) {
+		t.Fatalf("expected current input file flow to preserve uploaded text in history, got %q", out.HistoryText)
 	}
 }

@@ -411,15 +454,15 @@ func TestChatCompletionsCurrentInputFileUploadsContextAndKeepsNeutralPrompt(t *t
 		t.Fatalf("expected 1 upload call, got %d", len(ds.uploadCalls))
 	}
 	upload := ds.uploadCalls[0]
-	if upload.Filename != "IGNORE.txt" {
+	if upload.Filename != "history.txt" {
 		t.Fatalf("unexpected upload filename: %q", upload.Filename)
 	}
 	if upload.Purpose != "assistants" {
 		t.Fatalf("unexpected purpose: %q", upload.Purpose)
 	}
 	historyText := string(upload.Data)
-	if !strings.Contains(historyText, "[file content end]") || !strings.Contains(historyText, "[file name]: IGNORE") {
-		t.Fatalf("expected injected IGNORE wrapper, got %s", historyText)
+	if strings.Contains(historyText, "[file content end]") || strings.Contains(historyText, "[file content begin]") || strings.Contains(historyText, "[file name]:") {
+		t.Fatalf("expected plain history transcript without wrapper tags, got %s", historyText)
 	}
 	if !strings.Contains(historyText, "latest user turn") {
 		t.Fatalf("expected full context to include latest turn, got %s", historyText)
@@ -438,6 +481,16 @@ func TestChatCompletionsCurrentInputFileUploadsContextAndKeepsNeutralPrompt(t *t
 	if len(refIDs) == 0 || refIDs[0] != "file-inline-1" {
 		t.Fatalf("expected uploaded current input file to be first ref_file_id, got %#v", ds.completionReq["ref_file_ids"])
 	}
+	var body map[string]any
+	if err := json.Unmarshal(rec.Body.Bytes(), &body); err != nil {
+		t.Fatalf("decode response failed: %v", err)
+	}
+	usage, _ := body["usage"].(map[string]any)
+	promptTokens := int(usage["prompt_tokens"].(float64))
+	neutralCount := util.CountPromptTokens(promptText, "deepseek-v4-flash")
+	if promptTokens <= neutralCount {
+		t.Fatalf("expected prompt_tokens to exceed neutral live prompt count (includes file context), got=%d neutral=%d", promptTokens, neutralCount)
+	}
 }

 func TestResponsesCurrentInputFileUploadsContextAndKeepsNeutralPrompt(t *testing.T) {
@@ -480,6 +533,16 @@ func TestResponsesCurrentInputFileUploadsContextAndKeepsNeutralPrompt(t *testing
 	if strings.Contains(promptText, "first user turn") || strings.Contains(promptText, "latest user turn") {
 		t.Fatalf("expected prompt to hide original turns, got %s", promptText)
 	}
+	var body map[string]any
+	if err := json.Unmarshal(rec.Body.Bytes(), &body); err != nil {
+		t.Fatalf("decode response failed: %v", err)
+	}
+	usage, _ := body["usage"].(map[string]any)
+	inputTokens := int(usage["input_tokens"].(float64))
+	neutralCount := util.CountPromptTokens(promptText, "deepseek-v4-flash")
+	if inputTokens <= neutralCount {
+		t.Fatalf("expected input_tokens to exceed neutral live prompt count (includes file context), got=%d neutral=%d", inputTokens, neutralCount)
+	}
 }

 func TestChatCompletionsCurrentInputFileMapsManagedAuthFailureTo401(t *testing.T) {
--- a/internal/httpapi/openai/leaked_output_sanitize_test.go
+++ b/internal/httpapi/openai/leaked_output_sanitize_test.go
@@ -42,6 +42,14 @@ func TestSanitizeLeakedOutputRemovesDanglingThinkBlock(t *testing.T) {
 	}
 }

+func TestSanitizeLeakedOutputRemovesCompleteDSMLToolCallWrapper(t *testing.T) {
+	raw := "前置文本\n<｜DSML｜tool_calls>\n<｜DSML｜invoke name=\"Bash\">\n<｜DSML｜parameter name=\"command\"></｜DSML｜parameter>\n</｜DSML｜invoke>\n</｜DSML｜tool_calls>\n后置文本"
+	got := sanitizeLeakedOutput(raw)
+	if got != "前置文本\n\n后置文本" {
+		t.Fatalf("unexpected sanitize result for leaked dsml wrapper: %q", got)
+	}
+}
+
 func TestSanitizeLeakedOutputRemovesAgentXMLLeaks(t *testing.T) {
 	raw := "Done.<attempt_completion><result>Some final answer</result></attempt_completion>"
 	got := sanitizeLeakedOutput(raw)
--- a/internal/httpapi/openai/responses/empty_retry_runtime.go
+++ b/internal/httpapi/openai/responses/empty_retry_runtime.go
@@ -18,6 +18,8 @@ import (
 )

 type responsesNonStreamResult struct {
+	rawThinking           string
+	rawText               string
 	thinking              string
 	toolDetectionThinking string
 	text                  string
@@ -27,11 +29,12 @@ type responsesNonStreamResult struct {
 	responseMessageID     int
 }

-func (h *Handler) handleResponsesNonStreamWithRetry(w http.ResponseWriter, ctx context.Context, a *auth.RequestAuth, resp *http.Response, payload map[string]any, pow, owner, responseID, model, finalPrompt string, thinkingEnabled, searchEnabled bool, toolNames []string, toolsRaw any, toolChoice promptcompat.ToolChoicePolicy, traceID string) {
+func (h *Handler) handleResponsesNonStreamWithRetry(w http.ResponseWriter, ctx context.Context, a *auth.RequestAuth, resp *http.Response, payload map[string]any, pow, owner, responseID, model, finalPrompt string, refFileTokens int, thinkingEnabled, searchEnabled bool, toolNames []string, toolsRaw any, toolChoice promptcompat.ToolChoicePolicy, traceID string) {
 	attempts := 0
 	currentResp := resp
 	usagePrompt := finalPrompt
 	accumulatedThinking := ""
+	accumulatedRawThinking := ""
 	accumulatedToolDetectionThinking := ""
 	for {
 		result, ok := h.collectResponsesNonStreamAttempt(w, currentResp, responseID, model, usagePrompt, thinkingEnabled, searchEnabled, toolNames, toolsRaw)
@@ -39,11 +42,16 @@ func (h *Handler) handleResponsesNonStreamWithRetry(w http.ResponseWriter, ctx c
 			return
 		}
 		accumulatedThinking += sse.TrimContinuationOverlap(accumulatedThinking, result.thinking)
+		accumulatedRawThinking += sse.TrimContinuationOverlap(accumulatedRawThinking, result.rawThinking)
 		accumulatedToolDetectionThinking += sse.TrimContinuationOverlap(accumulatedToolDetectionThinking, result.toolDetectionThinking)
 		result.thinking = accumulatedThinking
+		result.rawThinking = accumulatedRawThinking
 		result.toolDetectionThinking = accumulatedToolDetectionThinking
-		result.parsed = detectAssistantToolCalls(result.text, result.thinking, result.toolDetectionThinking, toolNames)
+		result.parsed = detectAssistantToolCalls(result.rawText, result.text, result.rawThinking, result.toolDetectionThinking, toolNames)
 		result.body = openaifmt.BuildResponseObjectWithToolCalls(responseID, model, usagePrompt, result.thinking, result.text, result.parsed.Calls, toolsRaw)
+		if refFileTokens > 0 {
+			addRefFileTokensToUsage(result.body, refFileTokens)
+		}

 		if !shouldRetryResponsesNonStream(result, attempts) {
 			h.finishResponsesNonStreamResult(w, result, attempts, owner, responseID, toolChoice, traceID)
@@ -63,7 +71,7 @@ func (h *Handler) handleResponsesNonStreamWithRetry(w http.ResponseWriter, ctx c
 			config.Logger.Warn("[openai_empty_retry] retry request failed", "surface", "responses", "stream", false, "retry_attempt", attempts, "error", err)
 			return
 		}
-		usagePrompt = usagePromptWithEmptyOutputRetry(finalPrompt, attempts)
+		usagePrompt = usagePromptWithEmptyOutputRetry(usagePrompt, attempts)
 		currentResp = nextResp
 	}
 }
@@ -78,16 +86,17 @@ func (h *Handler) collectResponsesNonStreamAttempt(w http.ResponseWriter, resp *
 	result := sse.CollectStream(resp, thinkingEnabled, false)
 	stripReferenceMarkers := h.compatStripReferenceMarkers()
 	sanitizedThinking := cleanVisibleOutput(result.Thinking, stripReferenceMarkers)
-	toolDetectionThinking := cleanVisibleOutput(result.ToolDetectionThinking, stripReferenceMarkers)
 	sanitizedText := cleanVisibleOutput(result.Text, stripReferenceMarkers)
 	if searchEnabled {
 		sanitizedText = replaceCitationMarkersWithLinks(sanitizedText, result.CitationLinks)
 	}
-	textParsed := detectAssistantToolCalls(sanitizedText, sanitizedThinking, toolDetectionThinking, toolNames)
+	textParsed := detectAssistantToolCalls(result.Text, sanitizedText, result.Thinking, result.ToolDetectionThinking, toolNames)
 	responseObj := openaifmt.BuildResponseObjectWithToolCalls(responseID, model, usagePrompt, sanitizedThinking, sanitizedText, textParsed.Calls, toolsRaw)
 	return responsesNonStreamResult{
+		rawThinking:           result.Thinking,
+		rawText:               result.Text,
 		thinking:              sanitizedThinking,
-		toolDetectionThinking: toolDetectionThinking,
+		toolDetectionThinking: result.ToolDetectionThinking,
 		text:                  sanitizedText,
 		contentFilter:         result.ContentFilter,
 		parsed:                textParsed,
@@ -123,8 +132,8 @@ func shouldRetryResponsesNonStream(result responsesNonStreamResult, attempts int
 		strings.TrimSpace(result.text) == ""
 }

-func (h *Handler) handleResponsesStreamWithRetry(w http.ResponseWriter, r *http.Request, a *auth.RequestAuth, resp *http.Response, payload map[string]any, pow, owner, responseID, model, finalPrompt string, thinkingEnabled, searchEnabled bool, toolNames []string, toolsRaw any, toolChoice promptcompat.ToolChoicePolicy, traceID string) {
-	streamRuntime, initialType, ok := h.prepareResponsesStreamRuntime(w, resp, owner, responseID, model, finalPrompt, thinkingEnabled, searchEnabled, toolNames, toolsRaw, toolChoice, traceID)
+func (h *Handler) handleResponsesStreamWithRetry(w http.ResponseWriter, r *http.Request, a *auth.RequestAuth, resp *http.Response, payload map[string]any, pow, owner, responseID, model, finalPrompt string, refFileTokens int, thinkingEnabled, searchEnabled bool, toolNames []string, toolsRaw any, toolChoice promptcompat.ToolChoicePolicy, traceID string) {
+	streamRuntime, initialType, ok := h.prepareResponsesStreamRuntime(w, resp, owner, responseID, model, finalPrompt, refFileTokens, thinkingEnabled, searchEnabled, toolNames, toolsRaw, toolChoice, traceID)
 	if !ok {
 		return
 	}
@@ -165,7 +174,7 @@ func (h *Handler) handleResponsesStreamWithRetry(w http.ResponseWriter, r *http.
 	}
 }

-func (h *Handler) prepareResponsesStreamRuntime(w http.ResponseWriter, resp *http.Response, owner, responseID, model, finalPrompt string, thinkingEnabled, searchEnabled bool, toolNames []string, toolsRaw any, toolChoice promptcompat.ToolChoicePolicy, traceID string) (*responsesStreamRuntime, string, bool) {
+func (h *Handler) prepareResponsesStreamRuntime(w http.ResponseWriter, resp *http.Response, owner, responseID, model, finalPrompt string, refFileTokens int, thinkingEnabled, searchEnabled bool, toolNames []string, toolsRaw any, toolChoice promptcompat.ToolChoicePolicy, traceID string) (*responsesStreamRuntime, string, bool) {
 	if resp.StatusCode != http.StatusOK {
 		defer func() { _ = resp.Body.Close() }()
 		body, _ := io.ReadAll(resp.Body)
@@ -190,6 +199,7 @@ func (h *Handler) prepareResponsesStreamRuntime(w http.ResponseWriter, resp *htt
 			h.getResponseStore().put(owner, responseID, obj)
 		},
 	)
+	streamRuntime.refFileTokens = refFileTokens
 	streamRuntime.sendCreated()
 	return streamRuntime, initialType, true
 }
--- a/internal/httpapi/openai/responses/handler.go
+++ b/internal/httpapi/openai/responses/handler.go
@@ -130,6 +130,6 @@ func filterIncrementalToolCallDeltasByAllowed(deltas []toolstream.ToolCallDelta,
 	return shared.FilterIncrementalToolCallDeltasByAllowed(deltas, seenNames)
 }

-func detectAssistantToolCalls(text, exposedThinking, detectionThinking string, toolNames []string) toolcall.ToolCallParseResult {
-	return shared.DetectAssistantToolCalls(text, exposedThinking, detectionThinking, toolNames)
+func detectAssistantToolCalls(rawText, visibleText, exposedThinking, detectionThinking string, toolNames []string) toolcall.ToolCallParseResult {
+	return shared.DetectAssistantToolCalls(rawText, visibleText, exposedThinking, detectionThinking, toolNames)
 }
--- a/internal/httpapi/openai/responses/ref_file_tokens.go
+++ b/internal/httpapi/openai/responses/ref_file_tokens.go
@@ -0,0 +1,26 @@
+package responses
+
+// addRefFileTokensToUsage adds inline-uploaded file token estimates to an existing
+// usage map inside a response object. This keeps the token accounting aware of file
+// content that the upstream model processes but that is not part of the prompt text.
+func addRefFileTokensToUsage(obj map[string]any, refFileTokens int) {
+	if refFileTokens <= 0 || obj == nil {
+		return
+	}
+	usage, ok := obj["usage"].(map[string]any)
+	if !ok || usage == nil {
+		return
+	}
+	for _, key := range []string{"input_tokens", "prompt_tokens"} {
+		if v, ok := usage[key]; ok {
+			if n, ok := v.(int); ok {
+				usage[key] = n + refFileTokens
+			}
+		}
+	}
+	if v, ok := usage["total_tokens"]; ok {
+		if n, ok := v.(int); ok {
+			usage["total_tokens"] = n + refFileTokens
+		}
+	}
+}
--- a/internal/httpapi/openai/responses/responses_handler.go
+++ b/internal/httpapi/openai/responses/responses_handler.go
@@ -114,14 +114,15 @@ func (h *Handler) Responses(w http.ResponseWriter, r *http.Request) {
 	}

 	responseID := "resp_" + strings.ReplaceAll(uuid.NewString(), "-", "")
+	refFileTokens := stdReq.RefFileTokens
 	if stdReq.Stream {
-		h.handleResponsesStreamWithRetry(w, r, a, resp, payload, pow, owner, responseID, stdReq.ResponseModel, stdReq.FinalPrompt, stdReq.Thinking, stdReq.Search, stdReq.ToolNames, stdReq.ToolsRaw, stdReq.ToolChoice, traceID)
+		h.handleResponsesStreamWithRetry(w, r, a, resp, payload, pow, owner, responseID, stdReq.ResponseModel, stdReq.PromptTokenText, refFileTokens, stdReq.Thinking, stdReq.Search, stdReq.ToolNames, stdReq.ToolsRaw, stdReq.ToolChoice, traceID)
 		return
 	}
-	h.handleResponsesNonStreamWithRetry(w, r.Context(), a, resp, payload, pow, owner, responseID, stdReq.ResponseModel, stdReq.FinalPrompt, stdReq.Thinking, stdReq.Search, stdReq.ToolNames, stdReq.ToolsRaw, stdReq.ToolChoice, traceID)
+	h.handleResponsesNonStreamWithRetry(w, r.Context(), a, resp, payload, pow, owner, responseID, stdReq.ResponseModel, stdReq.PromptTokenText, refFileTokens, stdReq.Thinking, stdReq.Search, stdReq.ToolNames, stdReq.ToolsRaw, stdReq.ToolChoice, traceID)
 }

-func (h *Handler) handleResponsesNonStream(w http.ResponseWriter, resp *http.Response, owner, responseID, model, finalPrompt string, thinkingEnabled, searchEnabled bool, toolNames []string, toolsRaw any, toolChoice promptcompat.ToolChoicePolicy, traceID string) {
+func (h *Handler) handleResponsesNonStream(w http.ResponseWriter, resp *http.Response, owner, responseID, model, finalPrompt string, refFileTokens int, thinkingEnabled, searchEnabled bool, toolNames []string, toolsRaw any, toolChoice promptcompat.ToolChoicePolicy, traceID string) {
 	defer func() { _ = resp.Body.Close() }()
 	if resp.StatusCode != http.StatusOK {
 		body, _ := io.ReadAll(resp.Body)
@@ -131,12 +132,11 @@ func (h *Handler) handleResponsesNonStream(w http.ResponseWriter, resp *http.Res
 	result := sse.CollectStream(resp, thinkingEnabled, true)
 	stripReferenceMarkers := h.compatStripReferenceMarkers()
 	sanitizedThinking := cleanVisibleOutput(result.Thinking, stripReferenceMarkers)
-	toolDetectionThinking := cleanVisibleOutput(result.ToolDetectionThinking, stripReferenceMarkers)
 	sanitizedText := cleanVisibleOutput(result.Text, stripReferenceMarkers)
 	if searchEnabled {
 		sanitizedText = replaceCitationMarkersWithLinks(sanitizedText, result.CitationLinks)
 	}
-	textParsed := detectAssistantToolCalls(sanitizedText, sanitizedThinking, toolDetectionThinking, toolNames)
+	textParsed := detectAssistantToolCalls(result.Text, sanitizedText, result.Thinking, result.ToolDetectionThinking, toolNames)
 	if len(textParsed.Calls) == 0 && writeUpstreamEmptyOutputError(w, sanitizedText, sanitizedThinking, result.ContentFilter) {
 		return
 	}
@@ -149,11 +149,14 @@ func (h *Handler) handleResponsesNonStream(w http.ResponseWriter, resp *http.Res
 	}

 	responseObj := openaifmt.BuildResponseObjectWithToolCalls(responseID, model, finalPrompt, sanitizedThinking, sanitizedText, textParsed.Calls, toolsRaw)
+	if refFileTokens > 0 {
+		addRefFileTokensToUsage(responseObj, refFileTokens)
+	}
 	h.getResponseStore().put(owner, responseID, responseObj)
 	writeJSON(w, http.StatusOK, responseObj)
 }

-func (h *Handler) handleResponsesStream(w http.ResponseWriter, r *http.Request, resp *http.Response, owner, responseID, model, finalPrompt string, thinkingEnabled, searchEnabled bool, toolNames []string, toolsRaw any, toolChoice promptcompat.ToolChoicePolicy, traceID string) {
+func (h *Handler) handleResponsesStream(w http.ResponseWriter, r *http.Request, resp *http.Response, owner, responseID, model, finalPrompt string, refFileTokens int, thinkingEnabled, searchEnabled bool, toolNames []string, toolsRaw any, toolChoice promptcompat.ToolChoicePolicy, traceID string) {
 	defer func() { _ = resp.Body.Close() }()
 	if resp.StatusCode != http.StatusOK {
 		body, _ := io.ReadAll(resp.Body)
@@ -195,6 +198,7 @@ func (h *Handler) handleResponsesStream(w http.ResponseWriter, r *http.Request,
 			h.getResponseStore().put(owner, responseID, obj)
 		},
 	)
+	streamRuntime.refFileTokens = refFileTokens
 	streamRuntime.sendCreated()

 	streamengine.ConsumeSSE(streamengine.ConsumeConfig{
--- a/internal/httpapi/openai/responses/responses_stream_delta_batch.go
+++ b/internal/httpapi/openai/responses/responses_stream_delta_batch.go
@@ -0,0 +1,39 @@
+package responses
+
+import (
+	"strings"
+
+	openaifmt "ds2api/internal/format/openai"
+)
+
+type responsesDeltaBatch struct {
+	runtime *responsesStreamRuntime
+	kind    string
+	text    strings.Builder
+}
+
+func (b *responsesDeltaBatch) append(kind, text string) {
+	if text == "" {
+		return
+	}
+	if b.kind != "" && b.kind != kind {
+		b.flush()
+	}
+	b.kind = kind
+	b.text.WriteString(text)
+}
+
+func (b *responsesDeltaBatch) flush() {
+	if b.kind == "" || b.text.Len() == 0 {
+		return
+	}
+	text := b.text.String()
+	switch b.kind {
+	case "reasoning":
+		b.runtime.sendEvent("response.reasoning.delta", openaifmt.BuildResponsesReasoningDeltaPayload(b.runtime.responseID, text))
+	case "text":
+		b.runtime.emitTextDelta(text)
+	}
+	b.kind = ""
+	b.text.Reset()
+}
--- a/internal/httpapi/openai/responses/responses_stream_runtime_core.go
+++ b/internal/httpapi/openai/responses/responses_stream_runtime_core.go
@@ -18,13 +18,14 @@ type responsesStreamRuntime struct {
 	rc       *http.ResponseController
 	canFlush bool

-	responseID  string
-	model       string
-	finalPrompt string
-	toolNames   []string
-	toolsRaw    any
-	traceID     string
-	toolChoice  promptcompat.ToolChoicePolicy
+	responseID    string
+	model         string
+	finalPrompt   string
+	refFileTokens int
+	toolNames     []string
+	toolsRaw      any
+	traceID       string
+	toolChoice    promptcompat.ToolChoicePolicy

 	thinkingEnabled       bool
 	searchEnabled         bool
@@ -36,8 +37,10 @@ type responsesStreamRuntime struct {
 	toolCallsDoneEmitted bool

 	sieve                 toolstream.State
+	rawThinking           strings.Builder
 	thinking              strings.Builder
 	toolDetectionThinking strings.Builder
+	rawText               strings.Builder
 	text                  strings.Builder
 	visibleText           strings.Builder
 	responseMessageID     int
@@ -141,15 +144,14 @@ func (s *responsesStreamRuntime) finalize(finishReason string, deferEmptyOutput
 	s.finalErrorStatus = 0
 	s.finalErrorMessage = ""
 	s.finalErrorCode = ""
-	finalThinking := s.thinking.String()
-	finalToolDetectionThinking := s.toolDetectionThinking.String()
-	finalText := cleanVisibleOutput(s.text.String(), s.stripReferenceMarkers)
-
 	if s.bufferToolContent {
 		s.processToolStreamEvents(toolstream.Flush(&s.sieve, s.toolNames), true, true)
 	}

-	textParsed := detectAssistantToolCalls(finalText, finalThinking, finalToolDetectionThinking, s.toolNames)
+	finalThinking := s.thinking.String()
+	finalToolDetectionThinking := s.toolDetectionThinking.String()
+	finalText := cleanVisibleOutput(s.text.String(), s.stripReferenceMarkers)
+	textParsed := detectAssistantToolCalls(s.rawText.String(), finalText, s.rawThinking.String(), finalToolDetectionThinking, s.toolNames)
 	detected := textParsed.Calls
 	s.logToolPolicyRejections(textParsed)

@@ -220,6 +222,7 @@ func (s *responsesStreamRuntime) onParsed(parsed sse.LineResult) streamengine.Pa
 	}

 	contentSeen := false
+	batch := responsesDeltaBatch{runtime: s}
 	for _, p := range parsed.ToolDetectionThinkingParts {
 		trimmed := sse.TrimContinuationOverlap(s.toolDetectionThinking.String(), p.Text)
 		if trimmed != "" {
@@ -227,38 +230,53 @@ func (s *responsesStreamRuntime) onParsed(parsed sse.LineResult) streamengine.Pa
 		}
 	}
 	for _, p := range parsed.Parts {
-		cleanedText := cleanVisibleOutput(p.Text, s.stripReferenceMarkers)
-		if cleanedText == "" {
-			continue
-		}
-		if p.Type != "thinking" && s.searchEnabled && sse.IsCitation(cleanedText) {
-			continue
-		}
-		contentSeen = true
 		if p.Type == "thinking" {
+			rawTrimmed := sse.TrimContinuationOverlap(s.rawThinking.String(), p.Text)
+			if rawTrimmed != "" {
+				s.rawThinking.WriteString(rawTrimmed)
+				contentSeen = true
+			}
 			if !s.thinkingEnabled {
 				continue
 			}
+			cleanedText := cleanVisibleOutput(rawTrimmed, s.stripReferenceMarkers)
+			if cleanedText == "" {
+				continue
+			}
 			trimmed := sse.TrimContinuationOverlap(s.thinking.String(), cleanedText)
 			if trimmed == "" {
 				continue
 			}
 			s.thinking.WriteString(trimmed)
-			s.sendEvent("response.reasoning.delta", openaifmt.BuildResponsesReasoningDeltaPayload(s.responseID, trimmed))
+			batch.append("reasoning", trimmed)
 			continue
 		}

+		rawTrimmed := sse.TrimContinuationOverlap(s.rawText.String(), p.Text)
+		if rawTrimmed == "" {
+			continue
+		}
+		s.rawText.WriteString(rawTrimmed)
+		contentSeen = true
+		cleanedText := cleanVisibleOutput(rawTrimmed, s.stripReferenceMarkers)
+		if s.searchEnabled && sse.IsCitation(cleanedText) {
+			continue
+		}
 		trimmed := sse.TrimContinuationOverlap(s.text.String(), cleanedText)
-		if trimmed == "" {
-			continue
+		if trimmed != "" {
+			s.text.WriteString(trimmed)
 		}
-		s.text.WriteString(trimmed)
 		if !s.bufferToolContent {
-			s.emitTextDelta(trimmed)
+			if trimmed == "" {
+				continue
+			}
+			batch.append("text", trimmed)
 			continue
 		}
-		s.processToolStreamEvents(toolstream.ProcessChunk(&s.sieve, trimmed, s.toolNames), true, true)
+		batch.flush()
+		s.processToolStreamEvents(toolstream.ProcessChunk(&s.sieve, rawTrimmed, s.toolNames), true, true)
 	}

+	batch.flush()
 	return streamengine.ParsedDecision{ContentSeen: contentSeen}
 }
--- a/internal/httpapi/openai/responses/responses_stream_runtime_events.go
+++ b/internal/httpapi/openai/responses/responses_stream_runtime_events.go
@@ -4,6 +4,7 @@ import (
 	"encoding/json"

 	openaifmt "ds2api/internal/format/openai"
+	"ds2api/internal/sse"
 	"ds2api/internal/toolstream"
 )

@@ -43,7 +44,10 @@ func (s *responsesStreamRuntime) sendDone() {
 func (s *responsesStreamRuntime) processToolStreamEvents(events []toolstream.Event, emitContent bool, resetAfterToolCalls bool) {
 	for _, evt := range events {
 		if emitContent && evt.Content != "" {
-			s.emitTextDelta(evt.Content)
+			cleaned := cleanVisibleOutput(evt.Content, s.stripReferenceMarkers)
+			if cleaned != "" && (!s.searchEnabled || !sse.IsCitation(cleaned)) {
+				s.emitTextDelta(cleaned)
+			}
 		}
 		if len(evt.ToolCallDeltas) > 0 {
 			if !s.emitEarlyToolDeltas {
--- a/internal/httpapi/openai/responses/responses_stream_runtime_toolcalls_finalize.go
+++ b/internal/httpapi/openai/responses/responses_stream_runtime_toolcalls_finalize.go
@@ -145,7 +145,7 @@ func (s *responsesStreamRuntime) buildCompletedResponseObject(finalThinking, fin
 		}
 	}

-	return openaifmt.BuildResponseObjectFromItems(
+	obj := openaifmt.BuildResponseObjectFromItems(
 		s.responseID,
 		s.model,
 		s.finalPrompt,
@@ -154,4 +154,8 @@ func (s *responsesStreamRuntime) buildCompletedResponseObject(finalThinking, fin
 		output,
 		outputText,
 	)
+	if s.refFileTokens > 0 {
+		addRefFileTokensToUsage(obj, s.refFileTokens)
+	}
+	return obj
 }
--- a/internal/httpapi/openai/responses/responses_stream_test.go
+++ b/internal/httpapi/openai/responses/responses_stream_test.go
@@ -27,7 +27,7 @@ func TestHandleResponsesStreamDoesNotEmitReasoningTextCompatEvents(t *testing.T)
 		Body:       io.NopCloser(strings.NewReader(streamBody)),
 	}

-	h.handleResponsesStream(rec, req, resp, "owner-a", "resp_test", "deepseek-v4-pro", "prompt", true, false, nil, nil, promptcompat.DefaultToolChoicePolicy(), "")
+	h.handleResponsesStream(rec, req, resp, "owner-a", "resp_test", "deepseek-v4-pro", "prompt", 0, true, false, nil, nil, promptcompat.DefaultToolChoicePolicy(), "")

 	body := rec.Body.String()
 	if !strings.Contains(body, "event: response.reasoning.delta") {
@@ -57,7 +57,7 @@ func TestHandleResponsesStreamEmitsOutputTextDoneBeforeContentPartDone(t *testin
 		Body:       io.NopCloser(strings.NewReader(streamBody)),
 	}

-	h.handleResponsesStream(rec, req, resp, "owner-a", "resp_test", "deepseek-v4-flash", "prompt", false, false, nil, nil, promptcompat.DefaultToolChoicePolicy(), "")
+	h.handleResponsesStream(rec, req, resp, "owner-a", "resp_test", "deepseek-v4-flash", "prompt", 0, false, false, nil, nil, promptcompat.DefaultToolChoicePolicy(), "")
 	body := rec.Body.String()
 	if !strings.Contains(body, "event: response.output_text.done") {
 		t.Fatalf("expected response.output_text.done payload, body=%s", body)
@@ -91,7 +91,7 @@ func TestHandleResponsesStreamOutputTextDeltaCarriesItemIndexes(t *testing.T) {
 		Body:       io.NopCloser(strings.NewReader(streamBody)),
 	}

-	h.handleResponsesStream(rec, req, resp, "owner-a", "resp_test", "deepseek-v4-flash", "prompt", false, false, nil, nil, promptcompat.DefaultToolChoicePolicy(), "")
+	h.handleResponsesStream(rec, req, resp, "owner-a", "resp_test", "deepseek-v4-flash", "prompt", 0, false, false, nil, nil, promptcompat.DefaultToolChoicePolicy(), "")
 	body := rec.Body.String()

 	deltaPayload, ok := extractSSEEventPayload(body, "response.output_text.delta")
@@ -109,6 +109,48 @@ func TestHandleResponsesStreamOutputTextDeltaCarriesItemIndexes(t *testing.T) {
 	}
 }

+func TestHandleResponsesStreamCoalescesSmallOutputTextDeltas(t *testing.T) {
+	h := &Handler{}
+	req := httptest.NewRequest(http.MethodPost, "/v1/responses", nil)
+	rec := httptest.NewRecorder()
+
+	var streamBody strings.Builder
+	for i := 0; i < 100; i++ {
+		b, _ := json.Marshal(map[string]any{
+			"p": "response/content",
+			"v": "字",
+		})
+		streamBody.WriteString("data: ")
+		streamBody.WriteString(string(b))
+		streamBody.WriteString("\n")
+	}
+	streamBody.WriteString("data: [DONE]\n")
+	resp := &http.Response{
+		StatusCode: http.StatusOK,
+		Body:       io.NopCloser(strings.NewReader(streamBody.String())),
+	}
+
+	h.handleResponsesStream(rec, req, resp, "owner-a", "resp_coalesce", "deepseek-v4-flash", "prompt", 0, false, false, nil, nil, promptcompat.DefaultToolChoicePolicy(), "")
+
+	payloads := extractSSEEventPayloads(rec.Body.String(), "response.output_text.delta")
+	if len(payloads) == 0 {
+		t.Fatalf("expected response.output_text.delta payloads, body=%s", rec.Body.String())
+	}
+	var content strings.Builder
+	for _, payload := range payloads {
+		content.WriteString(asString(payload["delta"]))
+	}
+	if got, want := content.String(), strings.Repeat("字", 100); got != want {
+		t.Fatalf("coalesced response content mismatch: got %q want %q body=%s", got, want, rec.Body.String())
+	}
+	if len(payloads) >= 100 {
+		t.Fatalf("expected coalescing to reduce 100 tiny text deltas, got %d body=%s", len(payloads), rec.Body.String())
+	}
+	if !strings.Contains(rec.Body.String(), "event: response.completed") {
+		t.Fatalf("expected completed event, body=%s", rec.Body.String())
+	}
+}
+
 func TestHandleResponsesStreamEmitsDistinctToolCallIDsAcrossSeparateToolBlocks(t *testing.T) {
 	h := &Handler{}
 	req := httptest.NewRequest(http.MethodPost, "/v1/responses", nil)
@@ -130,7 +172,7 @@ func TestHandleResponsesStreamEmitsDistinctToolCallIDsAcrossSeparateToolBlocks(t
 		Body:       io.NopCloser(strings.NewReader(streamBody)),
 	}

-	h.handleResponsesStream(rec, req, resp, "owner-a", "resp_test", "deepseek-v4-flash", "prompt", false, false, []string{"read_file", "search"}, nil, promptcompat.DefaultToolChoicePolicy(), "")
+	h.handleResponsesStream(rec, req, resp, "owner-a", "resp_test", "deepseek-v4-flash", "prompt", 0, false, false, []string{"read_file", "search"}, nil, promptcompat.DefaultToolChoicePolicy(), "")

 	body := rec.Body.String()
 	doneEvents := extractSSEEventPayloads(body, "response.function_call_arguments.done")
@@ -183,7 +225,7 @@ func TestHandleResponsesStreamRequiredToolChoiceFailure(t *testing.T) {
 		Mode:    promptcompat.ToolChoiceRequired,
 		Allowed: map[string]struct{}{"read_file": {}},
 	}
-	h.handleResponsesStream(rec, req, resp, "owner-a", "resp_test", "deepseek-v4-flash", "prompt", false, false, []string{"read_file"}, nil, policy, "")
+	h.handleResponsesStream(rec, req, resp, "owner-a", "resp_test", "deepseek-v4-flash", "prompt", 0, false, false, []string{"read_file"}, nil, policy, "")

 	body := rec.Body.String()
 	if !strings.Contains(body, "event: response.failed") {
@@ -213,7 +255,7 @@ func TestHandleResponsesStreamFailsWhenUpstreamHasOnlyThinking(t *testing.T) {
 		Body:       io.NopCloser(strings.NewReader(streamBody)),
 	}

-	h.handleResponsesStream(rec, req, resp, "owner-a", "resp_test", "deepseek-v4-pro", "prompt", true, false, nil, nil, promptcompat.DefaultToolChoicePolicy(), "")
+	h.handleResponsesStream(rec, req, resp, "owner-a", "resp_test", "deepseek-v4-pro", "prompt", 0, true, false, nil, nil, promptcompat.DefaultToolChoicePolicy(), "")

 	body := rec.Body.String()
 	if !strings.Contains(body, "event: response.failed") {
@@ -251,11 +293,11 @@ func TestHandleResponsesStreamPromotesThinkingToolCallsOnFinalizeWithoutMidstrea
 		Body:       io.NopCloser(strings.NewReader(streamBody)),
 	}

-	h.handleResponsesStream(rec, req, resp, "owner-a", "resp_test", "deepseek-v4-pro", "prompt", true, false, []string{"read_file"}, nil, promptcompat.DefaultToolChoicePolicy(), "")
+	h.handleResponsesStream(rec, req, resp, "owner-a", "resp_test", "deepseek-v4-pro", "prompt", 0, true, false, []string{"read_file"}, nil, promptcompat.DefaultToolChoicePolicy(), "")

 	body := rec.Body.String()
-	if !strings.Contains(body, "event: response.reasoning.delta") {
-		t.Fatalf("expected reasoning delta in stream body, got %s", body)
+	if strings.Contains(body, "event: response.reasoning.delta") {
+		t.Fatalf("did not expect leaked reasoning delta in stream body, got %s", body)
 	}
 	if !strings.Contains(body, "event: response.function_call_arguments.done") {
 		t.Fatalf("expected finalize fallback function call event, got %s", body)
@@ -288,7 +330,7 @@ func TestHandleResponsesStreamPromotesHiddenThinkingDSMLToolCallsOnFinalize(t *t
 		Mode:    promptcompat.ToolChoiceRequired,
 		Allowed: map[string]struct{}{"read_file": {}},
 	}
-	h.handleResponsesStream(rec, req, resp, "owner-a", "resp_hidden", "deepseek-v4-pro", "prompt", false, false, []string{"read_file"}, nil, policy, "")
+	h.handleResponsesStream(rec, req, resp, "owner-a", "resp_hidden", "deepseek-v4-pro", "prompt", 0, false, false, []string{"read_file"}, nil, policy, "")

 	body := rec.Body.String()
 	if strings.Contains(body, "event: response.reasoning.delta") {
@@ -317,7 +359,7 @@ func TestHandleResponsesNonStreamRequiredToolChoiceViolation(t *testing.T) {
 		Allowed: map[string]struct{}{"read_file": {}},
 	}

-	h.handleResponsesNonStream(rec, resp, "owner-a", "resp_test", "deepseek-v4-flash", "prompt", false, false, []string{"read_file"}, nil, policy, "")
+	h.handleResponsesNonStream(rec, resp, "owner-a", "resp_test", "deepseek-v4-flash", "prompt", 0, false, false, []string{"read_file"}, nil, policy, "")
 	if rec.Code != http.StatusUnprocessableEntity {
 		t.Fatalf("expected 422 for required tool_choice violation, got %d body=%s", rec.Code, rec.Body.String())
 	}
@@ -344,7 +386,7 @@ func TestHandleResponsesNonStreamRequiredToolChoiceIgnoresThinkingToolPayloadWhe
 		Allowed: map[string]struct{}{"read_file": {}},
 	}

-	h.handleResponsesNonStream(rec, resp, "owner-a", "resp_test", "deepseek-v4-flash", "prompt", true, false, []string{"read_file"}, nil, policy, "")
+	h.handleResponsesNonStream(rec, resp, "owner-a", "resp_test", "deepseek-v4-flash", "prompt", 0, true, false, []string{"read_file"}, nil, policy, "")
 	if rec.Code != http.StatusUnprocessableEntity {
 		t.Fatalf("expected 422 for required tool_choice violation, got %d body=%s", rec.Code, rec.Body.String())
 	}
@@ -366,7 +408,7 @@ func TestHandleResponsesNonStreamReturns429WhenUpstreamOutputEmpty(t *testing.T)
 		)),
 	}

-	h.handleResponsesNonStream(rec, resp, "owner-a", "resp_test", "deepseek-v4-flash", "prompt", false, false, nil, nil, promptcompat.DefaultToolChoicePolicy(), "")
+	h.handleResponsesNonStream(rec, resp, "owner-a", "resp_test", "deepseek-v4-flash", "prompt", 0, false, false, nil, nil, promptcompat.DefaultToolChoicePolicy(), "")
 	if rec.Code != http.StatusTooManyRequests {
 		t.Fatalf("expected 429 for empty upstream output, got %d body=%s", rec.Code, rec.Body.String())
 	}
@@ -388,7 +430,7 @@ func TestHandleResponsesNonStreamReturnsContentFilterErrorWhenUpstreamFilteredWi
 		)),
 	}

-	h.handleResponsesNonStream(rec, resp, "owner-a", "resp_test", "deepseek-v4-flash", "prompt", false, false, nil, nil, promptcompat.DefaultToolChoicePolicy(), "")
+	h.handleResponsesNonStream(rec, resp, "owner-a", "resp_test", "deepseek-v4-flash", "prompt", 0, false, false, nil, nil, promptcompat.DefaultToolChoicePolicy(), "")
 	if rec.Code != http.StatusBadRequest {
 		t.Fatalf("expected 400 for filtered empty upstream output, got %d body=%s", rec.Code, rec.Body.String())
 	}
@@ -410,7 +452,7 @@ func TestHandleResponsesNonStreamReturns429WhenUpstreamHasOnlyThinking(t *testin
 		)),
 	}

-	h.handleResponsesNonStream(rec, resp, "owner-a", "resp_test", "deepseek-v4-pro", "prompt", true, false, nil, nil, promptcompat.DefaultToolChoicePolicy(), "")
+	h.handleResponsesNonStream(rec, resp, "owner-a", "resp_test", "deepseek-v4-pro", "prompt", 0, true, false, nil, nil, promptcompat.DefaultToolChoicePolicy(), "")
 	if rec.Code != http.StatusTooManyRequests {
 		t.Fatalf("expected 429 for thinking-only upstream output, got %d body=%s", rec.Code, rec.Body.String())
 	}
@@ -432,7 +474,7 @@ func TestHandleResponsesNonStreamPromotesThinkingToolCallsWhenTextEmpty(t *testi
 		)),
 	}

-	h.handleResponsesNonStream(rec, resp, "owner-a", "resp_test", "deepseek-v4-pro", "prompt", true, false, []string{"read_file"}, nil, promptcompat.DefaultToolChoicePolicy(), "")
+	h.handleResponsesNonStream(rec, resp, "owner-a", "resp_test", "deepseek-v4-pro", "prompt", 0, true, false, []string{"read_file"}, nil, promptcompat.DefaultToolChoicePolicy(), "")
 	if rec.Code != http.StatusOK {
 		t.Fatalf("expected 200 for thinking tool calls, got %d body=%s", rec.Code, rec.Body.String())
 	}
@@ -462,7 +504,7 @@ func TestHandleResponsesNonStreamPromotesHiddenThinkingDSMLToolCallsWhenTextEmpt
 		Mode:    promptcompat.ToolChoiceRequired,
 		Allowed: map[string]struct{}{"read_file": {}},
 	}
-	h.handleResponsesNonStream(rec, resp, "owner-a", "resp_hidden", "deepseek-v4-pro", "prompt", false, false, []string{"read_file"}, nil, policy, "")
+	h.handleResponsesNonStream(rec, resp, "owner-a", "resp_hidden", "deepseek-v4-pro", "prompt", 0, false, false, []string{"read_file"}, nil, policy, "")
 	if rec.Code != http.StatusOK {
 		t.Fatalf("expected 200 for hidden thinking tool calls, got %d body=%s", rec.Code, rec.Body.String())
 	}
@@ -509,7 +551,7 @@ func TestHandleResponsesStreamCoercesSchemaDeclaredStringArguments(t *testing.T)
 		Body:       io.NopCloser(strings.NewReader(streamBody)),
 	}

-	h.handleResponsesStream(rec, req, resp, "owner-a", "resp_string_protect", "deepseek-v4-flash", "prompt", false, false, []string{"Write"}, toolsRaw, promptcompat.DefaultToolChoicePolicy(), "")
+	h.handleResponsesStream(rec, req, resp, "owner-a", "resp_string_protect", "deepseek-v4-flash", "prompt", 0, false, false, []string{"Write"}, toolsRaw, promptcompat.DefaultToolChoicePolicy(), "")

 	payload, ok := extractSSEEventPayload(rec.Body.String(), "response.function_call_arguments.done")
 	if !ok {
--- a/internal/httpapi/openai/shared/assistant_toolcalls.go
+++ b/internal/httpapi/openai/shared/assistant_toolcalls.go
@@ -6,12 +6,12 @@ import (
 	"ds2api/internal/toolcall"
 )

-func DetectAssistantToolCalls(text, exposedThinking, detectionThinking string, toolNames []string) toolcall.ToolCallParseResult {
-	textParsed := toolcall.ParseStandaloneToolCallsDetailed(text, toolNames)
+func DetectAssistantToolCalls(rawText, visibleText, exposedThinking, detectionThinking string, toolNames []string) toolcall.ToolCallParseResult {
+	textParsed := toolcall.ParseStandaloneToolCallsDetailed(rawText, toolNames)
 	if len(textParsed.Calls) > 0 {
 		return textParsed
 	}
-	if strings.TrimSpace(text) != "" {
+	if strings.TrimSpace(visibleText) != "" {
 		return textParsed
 	}
 	thinking := detectionThinking
--- a/internal/httpapi/openai/shared/citation_links.go
+++ b/internal/httpapi/openai/shared/citation_links.go
@@ -13,7 +13,7 @@ func ReplaceCitationMarkersWithLinks(text string, links map[int]string) string {
 	if strings.TrimSpace(text) == "" || len(links) == 0 {
 		return text
 	}
-	zeroBased := strings.Contains(strings.ToLower(text), "[reference:0]")
+	zeroBasedReference := hasZeroBasedReferenceMarker(text)
 	return citationMarkerPattern.ReplaceAllStringFunc(text, func(match string) string {
 		sub := citationMarkerPattern.FindStringSubmatch(match)
 		if len(sub) < 3 {
@@ -24,7 +24,7 @@ func ReplaceCitationMarkersWithLinks(text string, links map[int]string) string {
 			return match
 		}
 		lookupIdx := idx
-		if zeroBased {
+		if strings.EqualFold(sub[1], "reference") && zeroBasedReference {
 			lookupIdx = idx + 1
 		}
 		url := strings.TrimSpace(links[lookupIdx])
@@ -34,3 +34,16 @@ func ReplaceCitationMarkersWithLinks(text string, links map[int]string) string {
 		return fmt.Sprintf("[%d](%s)", idx, url)
 	})
 }
+
+func hasZeroBasedReferenceMarker(text string) bool {
+	for _, sub := range citationMarkerPattern.FindAllStringSubmatch(text, -1) {
+		if len(sub) < 3 || !strings.EqualFold(sub[1], "reference") {
+			continue
+		}
+		idx, err := strconv.Atoi(strings.TrimSpace(sub[2]))
+		if err == nil && idx == 0 {
+			return true
+		}
+	}
+	return false
+}
--- a/internal/httpapi/openai/shared/leaked_output_sanitize.go
+++ b/internal/httpapi/openai/shared/leaked_output_sanitize.go
@@ -3,6 +3,8 @@ package shared
 import (
 	"regexp"
 	"strings"
+
+	"ds2api/internal/toolcall"
 )

 var emptyJSONFencePattern = regexp.MustCompile("(?is)```json\\s*```")
@@ -47,10 +49,42 @@ func sanitizeLeakedOutput(text string) string {
 	out = leakedThinkTagPattern.ReplaceAllString(out, "")
 	out = leakedBOSMarkerPattern.ReplaceAllString(out, "")
 	out = leakedMetaMarkerPattern.ReplaceAllString(out, "")
+	out = stripLeakedToolCallWrapperBlocks(out)
 	out = sanitizeLeakedAgentXMLBlocks(out)
 	return out
 }

+func stripLeakedToolCallWrapperBlocks(text string) string {
+	if text == "" {
+		return text
+	}
+	var b strings.Builder
+	pos := 0
+	for pos < len(text) {
+		tag, ok := toolcall.FindToolMarkupTagOutsideIgnored(text, pos)
+		if !ok {
+			b.WriteString(text[pos:])
+			break
+		}
+		if tag.Start > pos {
+			b.WriteString(text[pos:tag.Start])
+		}
+		if tag.Closing || tag.Name != "tool_calls" {
+			b.WriteString(text[tag.Start : tag.End+1])
+			pos = tag.End + 1
+			continue
+		}
+		closeTag, ok := toolcall.FindMatchingToolMarkupClose(text, tag)
+		if !ok {
+			b.WriteString(text[tag.Start : tag.End+1])
+			pos = tag.End + 1
+			continue
+		}
+		pos = closeTag.End + 1
+	}
+	return b.String()
+}
+
 func stripDanglingThinkSuffix(text string) string {
 	matches := leakedThinkTagPattern.FindAllStringIndex(text, -1)
 	if len(matches) == 0 {
--- a/internal/js/chat-stream/sse_parse_impl.js
+++ b/internal/js/chat-stream/sse_parse_impl.js
@@ -70,7 +70,6 @@ function finalizeThinkingParts(parts, thinkingEnabled, newType) {
  }
  if (!thinkingEnabled) {
    finalParts = dropThinkingParts(finalParts);
-    finalType = 'text';
  }
  return { parts: finalParts, newType: finalType };
 }
@@ -213,6 +212,12 @@ function parseChunkForContent(chunk, thinkingEnabled, currentType, stripReferenc
    }
  }

+  if (pathValue === 'response/content') {
+    newType = 'text';
+  } else if (pathValue === 'response/thinking_content' && (!thinkingEnabled || newType !== 'text')) {
+    newType = 'thinking';
+  }
+
  let partType = 'text';
  if (pathValue === 'response/thinking_content') {
    if (!thinkingEnabled) {
@@ -226,8 +231,8 @@ function parseChunkForContent(chunk, thinkingEnabled, currentType, stripReferenc
    partType = 'text';
  } else if (pathValue.includes('response/fragments') && pathValue.includes('/content')) {
    partType = newType;
-  } else if (!pathValue && thinkingEnabled) {
-    partType = newType;
+  } else if (!pathValue) {
+    partType = newType || 'text';
  }

  const val = chunk.v;
@@ -308,6 +313,10 @@ function parseChunkForContent(chunk, thinkingEnabled, currentType, stripReferenc
  }

  if (val && typeof val === 'object') {
+    const directContent = asContentString(val, stripReferenceMarkers);
+    if (directContent) {
+      parts.push({ text: directContent, type: partType });
+    }
    const resp = val.response && typeof val.response === 'object' ? val.response : val;
    if (Array.isArray(resp.fragments)) {
      for (const frag of resp.fragments) {
@@ -593,6 +602,12 @@ function asContentString(v, stripReferenceMarkers = true) {
    if (Object.prototype.hasOwnProperty.call(v, 'v')) {
      return asContentString(v.v, stripReferenceMarkers);
    }
+    if (Object.prototype.hasOwnProperty.call(v, 'text')) {
+      return asContentString(v.text, stripReferenceMarkers);
+    }
+    if (Object.prototype.hasOwnProperty.call(v, 'value')) {
+      return asContentString(v.value, stripReferenceMarkers);
+    }
    return '';
  }
  if (v == null) {
--- a/internal/js/chat-stream/stream_emitter.js
+++ b/internal/js/chat-stream/stream_emitter.js
@@ -1,5 +1,8 @@
 'use strict';

+const MIN_DELTA_FLUSH_CHARS = 160;
+const MAX_DELTA_FLUSH_WAIT_MS = 80;
+
 function createChatCompletionEmitter({ res, sessionID, created, model, isClosed }) {
  let firstChunkSent = false;

@@ -34,6 +37,62 @@ function createChatCompletionEmitter({ res, sessionID, created, model, isClosed
  };
 }

+function createDeltaCoalescer({ sendDeltaFrame, minFlushChars = MIN_DELTA_FLUSH_CHARS, maxFlushWaitMS = MAX_DELTA_FLUSH_WAIT_MS }) {
+  let pendingField = '';
+  let pendingText = '';
+  let flushTimer = null;
+
+  const clearFlushTimer = () => {
+    if (flushTimer) {
+      clearTimeout(flushTimer);
+      flushTimer = null;
+    }
+  };
+
+  const flush = () => {
+    clearFlushTimer();
+    if (!pendingField || !pendingText) {
+      return;
+    }
+    const delta = { [pendingField]: pendingText };
+    pendingField = '';
+    pendingText = '';
+    sendDeltaFrame(delta);
+  };
+
+  const scheduleFlush = () => {
+    if (flushTimer || maxFlushWaitMS <= 0) {
+      return;
+    }
+    flushTimer = setTimeout(flush, maxFlushWaitMS);
+    if (typeof flushTimer.unref === 'function') {
+      flushTimer.unref();
+    }
+  };
+
+  const append = (field, text) => {
+    if (!field || !text) {
+      return;
+    }
+    if (pendingField && pendingField !== field) {
+      flush();
+    }
+    pendingField = field;
+    pendingText += text;
+    if ([...pendingText].length >= minFlushChars) {
+      flush();
+      return;
+    }
+    scheduleFlush();
+  };
+
+  return {
+    append,
+    flush,
+  };
+}
+
 module.exports = {
  createChatCompletionEmitter,
+  createDeltaCoalescer,
 };
--- a/internal/js/chat-stream/vercel_stream_impl.js
+++ b/internal/js/chat-stream/vercel_stream_impl.js
@@ -20,7 +20,7 @@ const {
  boolDefaultTrue,
  resetStreamToolCallState,
 } = require('./toolcall_policy');
-const { createChatCompletionEmitter } = require('./stream_emitter');
+const { createChatCompletionEmitter, createDeltaCoalescer } = require('./stream_emitter');
 const {
  asString,
  isAbortError,
@@ -191,6 +191,7 @@ async function handleVercelStream(req, res, rawBody, payload) {
      model,
      isClosed: () => clientClosed,
    });
+    const deltaCoalescer = createDeltaCoalescer({ sendDeltaFrame });

    const finish = async (reason, options = {}) => {
      if (ended) {
@@ -201,25 +202,28 @@ async function handleVercelStream(req, res, rawBody, payload) {
        await releaseLease();
        return true;
      }
+      deltaCoalescer.flush();
      const detected = parseStandaloneToolCalls(outputText, toolNames);
      if (detected.length > 0 && !toolCallsDoneEmitted) {
        toolCallsEmitted = true;
        toolCallsDoneEmitted = true;
-        sendDeltaFrame({ tool_calls: formatOpenAIStreamToolCalls(detected, streamToolCallIDs) });
+        sendDeltaFrame({ tool_calls: formatOpenAIStreamToolCalls(detected, streamToolCallIDs, payload.tools) });
      } else if (toolSieveEnabled) {
        const tailEvents = flushToolSieve(toolSieveState, toolNames);
        for (const evt of tailEvents) {
          if (evt.type === 'tool_calls' && Array.isArray(evt.calls) && evt.calls.length > 0) {
+            deltaCoalescer.flush();
            toolCallsEmitted = true;
            toolCallsDoneEmitted = true;
-            sendDeltaFrame({ tool_calls: formatOpenAIStreamToolCalls(evt.calls, streamToolCallIDs) });
+            sendDeltaFrame({ tool_calls: formatOpenAIStreamToolCalls(evt.calls, streamToolCallIDs, payload.tools) });
            resetStreamToolCallState(streamToolCallIDs, streamToolNames);
            continue;
          }
          if (evt.text) {
-            sendDeltaFrame({ content: evt.text });
+            deltaCoalescer.append('content', evt.text);
          }
        }
+        deltaCoalescer.flush();
      }
      if (detected.length > 0 || toolCallsEmitted) {
        reason = 'tool_calls';
@@ -327,7 +331,7 @@ async function handleVercelStream(req, res, rawBody, payload) {
                      continue;
                    }
                    thinkingText += trimmed;
-                    sendDeltaFrame({ reasoning_content: trimmed });
+                    deltaCoalescer.append('reasoning_content', trimmed);
                  }
                } else {
                  const trimmed = trimContinuationOverlap(outputText, p.text);
@@ -339,7 +343,7 @@ async function handleVercelStream(req, res, rawBody, payload) {
                  }
                  outputText += trimmed;
                  if (!toolSieveEnabled) {
-                    sendDeltaFrame({ content: trimmed });
+                    deltaCoalescer.append('content', trimmed);
                    continue;
                  }
                  const events = processToolSieveChunk(toolSieveState, trimmed, toolNames);
@@ -352,6 +356,7 @@ async function handleVercelStream(req, res, rawBody, payload) {
                      const formatted = formatIncrementalToolCallDeltas(filtered, streamToolCallIDs);
                      if (formatted.length > 0) {
                        toolCallsEmitted = true;
+                        deltaCoalescer.flush();
                        sendDeltaFrame({ tool_calls: formatted });
                      }
                      continue;
@@ -359,12 +364,13 @@ async function handleVercelStream(req, res, rawBody, payload) {
                    if (evt.type === 'tool_calls') {
                      toolCallsEmitted = true;
                      toolCallsDoneEmitted = true;
-                      sendDeltaFrame({ tool_calls: formatOpenAIStreamToolCalls(evt.calls, streamToolCallIDs) });
+                      deltaCoalescer.flush();
+                      sendDeltaFrame({ tool_calls: formatOpenAIStreamToolCalls(evt.calls, streamToolCallIDs, payload.tools) });
                      resetStreamToolCallState(streamToolCallIDs, streamToolNames);
                      continue;
                    }
                    if (evt.text) {
-                      sendDeltaFrame({ content: evt.text });
+                      deltaCoalescer.append('content', evt.text);
                    }
                  }
                }
@@ -510,27 +516,87 @@ function observeContinueState(state, chunk) {
  if (topID > 0) {
    state.responseMessageID = topID;
  }
-  if (chunk.p === 'response/status') {
-    setContinueStatus(state, asString(chunk.v));
+  observeContinueDirectPatch(state, chunk.p, chunk.v);
+  if (chunk.p === 'response') {
+    observeContinueBatchPatches(state, 'response', chunk.v);
+  } else {
+    observeContinueBatchPatches(state, '', chunk.v);
  }
  const response = chunk.v && typeof chunk.v === 'object' ? chunk.v.response : null;
-  if (response && typeof response === 'object') {
-    const id = numberValue(response.message_id);
-    if (id > 0) {
-      state.responseMessageID = id;
-    }
-    setContinueStatus(state, asString(response.status));
-    if (response.auto_continue === true) {
-      state.lastStatus = 'AUTO_CONTINUE';
-    }
-  }
+  observeContinueResponseObject(state, response);
  const messageResponse = chunk.message && typeof chunk.message === 'object' && chunk.message.response;
-  if (messageResponse && typeof messageResponse === 'object') {
-    const id = numberValue(messageResponse.message_id);
-    if (id > 0) {
-      state.responseMessageID = id;
+  observeContinueResponseObject(state, messageResponse);
+}
+
+function observeContinueDirectPatch(state, path, value) {
+  if (!state) {
+    return;
+  }
+  switch (asString(path).trim().replace(/^\/+|\/+$/g, '')) {
+    case 'response/status':
+    case 'status':
+    case 'response/quasi_status':
+    case 'quasi_status':
+      setContinueStatus(state, asString(value));
+      break;
+    case 'response/auto_continue':
+    case 'auto_continue':
+      if (value === true) {
+        state.lastStatus = 'AUTO_CONTINUE';
+      }
+      break;
+    default:
+      break;
+  }
+}
+
+function observeContinueResponseObject(state, response) {
+  if (!state || !response || typeof response !== 'object') {
+    return;
+  }
+  const id = numberValue(response.message_id);
+  if (id > 0) {
+    state.responseMessageID = id;
+  }
+  setContinueStatus(state, asString(response.status));
+  if (response.auto_continue === true) {
+    state.lastStatus = 'AUTO_CONTINUE';
+  }
+}
+
+function observeContinueBatchPatches(state, parentPath, raw) {
+  if (!state || !Array.isArray(raw)) {
+    return;
+  }
+  for (const patch of raw) {
+    if (!patch || typeof patch !== 'object') {
+      continue;
+    }
+    const path = asString(patch.p).trim();
+    if (!path) {
+      continue;
+    }
+    let fullPath = path;
+    const parent = asString(parentPath).trim().replace(/^\/+|\/+$/g, '');
+    if (parent && !path.includes('/')) {
+      fullPath = `${parent}/${path}`;
+    }
+    switch (fullPath.replace(/^\/+|\/+$/g, '')) {
+      case 'response/status':
+      case 'status':
+      case 'response/quasi_status':
+      case 'quasi_status':
+        setContinueStatus(state, asString(patch.v));
+        break;
+      case 'response/auto_continue':
+      case 'auto_continue':
+        if (patch.v === true) {
+          state.lastStatus = 'AUTO_CONTINUE';
+        }
+        break;
+      default:
+        break;
    }
-    setContinueStatus(state, asString(messageResponse.status));
  }
 }

@@ -540,7 +606,7 @@ function setContinueStatus(state, status) {
    return;
  }
  state.lastStatus = normalized;
-  if (normalized.toUpperCase() === 'FINISHED') {
+  if (['FINISHED', 'CONTENT_FILTER'].includes(normalized.toUpperCase())) {
    state.finished = true;
  }
 }
@@ -549,7 +615,7 @@ function shouldAutoContinue(state) {
  if (!state || state.finished || !state.sessionID || state.responseMessageID <= 0) {
    return false;
  }
-  return ['WIP', 'INCOMPLETE', 'AUTO_CONTINUE'].includes(asString(state.lastStatus).trim().toUpperCase());
+  return ['INCOMPLETE', 'AUTO_CONTINUE'].includes(asString(state.lastStatus).trim().toUpperCase());
 }

 function numberValue(v) {
--- a/internal/js/helpers/stream-tool-sieve/format.js
+++ b/internal/js/helpers/stream-tool-sieve/format.js
@@ -2,11 +2,12 @@

 const crypto = require('crypto');

-function formatOpenAIStreamToolCalls(calls, idStore) {
+function formatOpenAIStreamToolCalls(calls, idStore, toolsRaw) {
  if (!Array.isArray(calls) || calls.length === 0) {
    return [];
  }
-  return calls.map((c, idx) => ({
+  const normalized = normalizeParsedToolCallsForSchemas(calls, toolsRaw);
+  return normalized.map((c, idx) => ({
    index: idx,
    id: ensureStreamToolCallID(idStore, idx),
    type: 'function',
@@ -17,6 +18,194 @@ function formatOpenAIStreamToolCalls(calls, idStore) {
  }));
 }

+function normalizeParsedToolCallsForSchemas(calls, toolsRaw) {
+  if (!Array.isArray(calls) || calls.length === 0) {
+    return calls;
+  }
+  const schemas = buildToolSchemaIndex(toolsRaw);
+  if (!schemas) {
+    return calls;
+  }
+  let changedAny = false;
+  const out = calls.map((call) => {
+    const name = String(call && call.name || '').trim().toLowerCase();
+    const schema = schemas[name];
+    if (!schema || !call || !call.input || typeof call.input !== 'object' || Array.isArray(call.input)) {
+      return call;
+    }
+    const [normalized, changed] = normalizeToolValueWithSchema(call.input, schema);
+    if (!changed || !normalized || typeof normalized !== 'object' || Array.isArray(normalized)) {
+      return call;
+    }
+    changedAny = true;
+    return { ...call, input: normalized };
+  });
+  return changedAny ? out : calls;
+}
+
+function buildToolSchemaIndex(toolsRaw) {
+  if (!Array.isArray(toolsRaw) || toolsRaw.length === 0) {
+    return null;
+  }
+  const out = {};
+  for (const item of toolsRaw) {
+    if (!item || typeof item !== 'object' || Array.isArray(item)) {
+      continue;
+    }
+    const [name, schema] = extractToolNameAndSchema(item);
+    if (!name || !schema || typeof schema !== 'object' || Array.isArray(schema)) {
+      continue;
+    }
+    out[name.toLowerCase()] = schema;
+  }
+  return Object.keys(out).length > 0 ? out : null;
+}
+
+function extractToolNameAndSchema(tool) {
+  const fn = tool && typeof tool.function === 'object' && !Array.isArray(tool.function) ? tool.function : null;
+  const name = firstNonEmptyString(tool.name, fn && fn.name);
+  const schema = firstNonNil(
+    tool.parameters,
+    tool.input_schema,
+    tool.inputSchema,
+    tool.schema,
+    fn && fn.parameters,
+    fn && fn.input_schema,
+    fn && fn.inputSchema,
+    fn && fn.schema,
+  );
+  return [name, schema];
+}
+
+function normalizeToolValueWithSchema(value, schema) {
+  if (value == null || !schema || typeof schema !== 'object' || Array.isArray(schema)) {
+    return [value, false];
+  }
+  if (shouldCoerceSchemaToString(schema)) {
+    return stringifySchemaValue(value);
+  }
+  if (looksLikeObjectSchema(schema)) {
+    if (!value || typeof value !== 'object' || Array.isArray(value)) {
+      return [value, false];
+    }
+    const properties = schema.properties && typeof schema.properties === 'object' && !Array.isArray(schema.properties) ? schema.properties : null;
+    const additional = schema.additionalProperties;
+    let changed = false;
+    const out = {};
+    for (const [key, current] of Object.entries(value)) {
+      let next = current;
+      let fieldChanged = false;
+      if (properties && Object.prototype.hasOwnProperty.call(properties, key)) {
+        [next, fieldChanged] = normalizeToolValueWithSchema(current, properties[key]);
+      } else if (additional != null) {
+        [next, fieldChanged] = normalizeToolValueWithSchema(current, additional);
+      }
+      out[key] = next;
+      changed = changed || fieldChanged;
+    }
+    return changed ? [out, true] : [value, false];
+  }
+  if (looksLikeArraySchema(schema)) {
+    if (!Array.isArray(value) || value.length === 0 || schema.items == null) {
+      return [value, false];
+    }
+    let changed = false;
+    const out = value.map((item, idx) => {
+      const itemSchema = Array.isArray(schema.items) ? schema.items[idx] : schema.items;
+      if (itemSchema == null) {
+        return item;
+      }
+      const [next, itemChanged] = normalizeToolValueWithSchema(item, itemSchema);
+      changed = changed || itemChanged;
+      return next;
+    });
+    return changed ? [out, true] : [value, false];
+  }
+  return [value, false];
+}
+
+function shouldCoerceSchemaToString(schema) {
+  if (!schema || typeof schema !== 'object' || Array.isArray(schema)) {
+    return false;
+  }
+  if (typeof schema.const === 'string') {
+    return true;
+  }
+  if (Array.isArray(schema.enum) && schema.enum.length > 0 && schema.enum.every((item) => typeof item === 'string')) {
+    return true;
+  }
+  if (typeof schema.type === 'string') {
+    return schema.type.trim().toLowerCase() === 'string';
+  }
+  if (Array.isArray(schema.type) && schema.type.length > 0) {
+    let hasString = false;
+    for (const item of schema.type) {
+      if (typeof item !== 'string') {
+        return false;
+      }
+      const typ = item.trim().toLowerCase();
+      if (typ === 'string') {
+        hasString = true;
+      } else if (typ !== 'null') {
+        return false;
+      }
+    }
+    return hasString;
+  }
+  return false;
+}
+
+function looksLikeObjectSchema(schema) {
+  return !!schema && typeof schema === 'object' && !Array.isArray(schema) && (
+    (typeof schema.type === 'string' && schema.type.trim().toLowerCase() === 'object') ||
+    (schema.properties && typeof schema.properties === 'object' && !Array.isArray(schema.properties)) ||
+    schema.additionalProperties != null
+  );
+}
+
+function looksLikeArraySchema(schema) {
+  return !!schema && typeof schema === 'object' && !Array.isArray(schema) && (
+    (typeof schema.type === 'string' && schema.type.trim().toLowerCase() === 'array') ||
+    schema.items != null
+  );
+}
+
+function stringifySchemaValue(value) {
+  if (value == null) {
+    return [value, false];
+  }
+  if (typeof value === 'string') {
+    return [value, false];
+  }
+  try {
+    return [JSON.stringify(value), true];
+  } catch {
+    return [value, false];
+  }
+}
+
+function firstNonNil(...values) {
+  for (const value of values) {
+    if (value != null) {
+      return value;
+    }
+  }
+  return null;
+}
+
+function firstNonEmptyString(...values) {
+  for (const value of values) {
+    if (typeof value !== 'string') {
+      continue;
+    }
+    const trimmed = value.trim();
+    if (trimmed) {
+      return trimmed;
+    }
+  }
+  return '';
+}
+
 function ensureStreamToolCallID(idStore, index) {
  if (!(idStore instanceof Map)) {
    return `call_${newCallID()}`;
--- a/internal/js/helpers/stream-tool-sieve/parse_payload.js
+++ b/internal/js/helpers/stream-tool-sieve/parse_payload.js
@@ -248,6 +248,9 @@ function replaceDSMLToolMarkupOutsideIgnored(text) {
    if (tag) {
      if (tag.dsmlLike) {
        out += `<${tag.closing ? '/' : ''}${tag.name}${raw.slice(tag.nameEnd, tag.end + 1)}`;
+        if (raw[tag.end] !== '>') {
+          out += '>';
+        }
      } else {
        out += raw.slice(tag.start, tag.end + 1);
      }
@@ -424,31 +427,42 @@ function scanToolMarkupTagAt(text, start) {
  }
  const lower = raw.toLowerCase();
  let i = start + 1;
+  while (i < raw.length && raw[i] === '<') {
+    i += 1;
+  }
  const closing = raw[i] === '/';
  if (closing) {
    i += 1;
  }
-  let dsmlLike = false;
-  if (i < raw.length && isToolMarkupPipe(raw[i])) {
-    dsmlLike = true;
-    i += 1;
-  }
-  if (lower.startsWith('dsml', i)) {
-    dsmlLike = true;
-    i += 'dsml'.length;
-    while (i < raw.length && isToolMarkupSeparator(raw[i])) {
-      i += 1;
-    }
-  }
+  const prefix = consumeToolMarkupNamePrefix(raw, lower, i);
+  i = prefix.next;
+  const dsmlLike = prefix.dsmlLike;
  const { name, len } = matchToolMarkupName(lower, i);
  if (!name) {
    return null;
  }
-  const nameEnd = i + len;
+  const originalNameEnd = i + len;
+  let nameEnd = originalNameEnd;
+  while (nameEnd < raw.length && isToolMarkupPipe(raw[nameEnd])) {
+    nameEnd += 1;
+  }
+  const hasTrailingPipe = nameEnd > originalNameEnd;
  if (!hasXmlTagBoundary(raw, nameEnd)) {
    return null;
  }
-  const end = findXmlTagEnd(raw, nameEnd);
+  let end = findXmlTagEnd(raw, nameEnd);
+  if (end < 0) {
+    if (!hasTrailingPipe) {
+      return null;
+    }
+    end = nameEnd - 1;
+  }
+  if (hasTrailingPipe) {
+    const nextLT = raw.indexOf('<', nameEnd);
+    if (nextLT >= 0 && end >= nextLT) {
+      end = nameEnd - 1;
+    }
+  }
  if (end < 0) {
    return null;
  }
@@ -520,37 +534,94 @@ function findPartialToolMarkupStart(text) {
  if (lastLT < 0) {
    return -1;
  }
-  const tail = raw.slice(lastLT);
+  const start = includeDuplicateLeadingLessThan(raw, lastLT);
+  const tail = raw.slice(start);
  if (tail.includes('>')) {
    return -1;
  }
-  const lowerTail = tail.toLowerCase();
-  const candidates = [
-    '<tool_calls', '<invoke', '<parameter',
-    '<|tool_calls', '<|invoke', '<|parameter',
-    '<｜tool_calls', '<｜invoke', '<｜parameter',
-    '<|dsml|tool_calls', '<|dsml|invoke', '<|dsml|parameter',
-    '<｜dsml|tool_calls', '<｜dsml|invoke', '<｜dsml|parameter',
-    '<dsmltool_calls', '<dsmlinvoke', '<dsmlparameter',
-    '<dsml tool_calls', '<dsml invoke', '<dsml parameter',
-    '<dsml|tool_calls', '<dsml|invoke', '<dsml|parameter',
-    '<|dsmltool_calls', '<|dsmlinvoke', '<|dsmlparameter',
-    '<|dsml tool_calls', '<|dsml invoke', '<|dsml parameter',
-  ];
-  for (const candidate of candidates) {
-    if (candidate.startsWith(lowerTail)) {
-      return lastLT;
-    }
+  return isPartialToolMarkupTagPrefix(tail) ? start : -1;
+}
+
+function includeDuplicateLeadingLessThan(text, idx) {
+  let out = idx;
+  while (out > 0 && text[out - 1] === '<') {
+    out -= 1;
  }
-  return -1;
+  return out;
 }

 function isToolMarkupPipe(ch) {
  return ch === '|' || ch === '｜';
 }

-function isToolMarkupSeparator(ch) {
-  return ch === ' ' || ch === '\t' || ch === '\r' || ch === '\n' || isToolMarkupPipe(ch);
+function isPartialToolMarkupTagPrefix(text) {
+  const raw = toStringSafe(text);
+  if (!raw || raw[0] !== '<' || raw.includes('>')) {
+    return false;
+  }
+  const lower = raw.toLowerCase();
+  let i = 1;
+  while (i < raw.length && raw[i] === '<') {
+    i += 1;
+  }
+  if (i >= raw.length) {
+    return true;
+  }
+  if (raw[i] === '/') {
+    i += 1;
+  }
+  while (i <= raw.length) {
+    if (i === raw.length) {
+      return true;
+    }
+    if (hasToolMarkupNamePrefix(lower.slice(i))) {
+      return true;
+    }
+    if ('dsml'.startsWith(lower.slice(i))) {
+      return true;
+    }
+    const next = consumeToolMarkupNamePrefixOnce(raw, lower, i);
+    if (!next.ok) {
+      return false;
+    }
+    i = next.next;
+  }
+  return false;
+}
+
+function consumeToolMarkupNamePrefix(raw, lower, idx) {
+  let next = idx;
+  let dsmlLike = false;
+  while (true) {
+    const consumed = consumeToolMarkupNamePrefixOnce(raw, lower, next);
+    if (!consumed.ok) {
+      return { next, dsmlLike };
+    }
+    next = consumed.next;
+    dsmlLike = true;
+  }
+}
+
+function consumeToolMarkupNamePrefixOnce(raw, lower, idx) {
+  if (idx < raw.length && isToolMarkupPipe(raw[idx])) {
+    return { next: idx + 1, ok: true };
+  }
+  if (idx < raw.length && [' ', '\t', '\r', '\n'].includes(raw[idx])) {
+    return { next: idx + 1, ok: true };
+  }
+  if (lower.startsWith('dsml', idx)) {
+    return { next: idx + 'dsml'.length, ok: true };
+  }
+  return { next: idx, ok: false };
+}
+
+function hasToolMarkupNamePrefix(lowerTail) {
+  for (const name of TOOL_MARKUP_NAMES) {
+    if (lowerTail.startsWith(name) || name.startsWith(lowerTail)) {
+      return true;
+    }
+  }
+  return false;
 }

 function matchToolMarkupName(lower, start) {
--- a/internal/js/helpers/stream-tool-sieve/tool-keywords.js
+++ b/internal/js/helpers/stream-tool-sieve/tool-keywords.js
@@ -1,55 +0,0 @@
-'use strict';
-
-const XML_TOOL_SEGMENT_TAGS = [
-  '<|dsml|tool_calls>', '<|dsml|tool_calls\n', '<|dsml|tool_calls ',
-  '<｜dsml|tool_calls>', '<｜dsml|tool_calls\n', '<｜dsml|tool_calls ',
-  '<|dsml|invoke ', '<|dsml|invoke\n', '<|dsml|invoke\t', '<|dsml|invoke\r',
-  '<|dsmltool_calls>', '<|dsmltool_calls\n', '<|dsmltool_calls ',
-  '<|dsmlinvoke ', '<|dsmlinvoke\n', '<|dsmlinvoke\t', '<|dsmlinvoke\r',
-  '<|dsml tool_calls>', '<|dsml tool_calls\n', '<|dsml tool_calls ',
-  '<|dsml invoke ', '<|dsml invoke\n', '<|dsml invoke\t', '<|dsml invoke\r',
-  '<dsml|tool_calls>', '<dsml|tool_calls\n', '<dsml|tool_calls ',
-  '<dsml|invoke ', '<dsml|invoke\n', '<dsml|invoke\t', '<dsml|invoke\r',
-  '<dsmltool_calls>', '<dsmltool_calls\n', '<dsmltool_calls ',
-  '<dsmlinvoke ', '<dsmlinvoke\n', '<dsmlinvoke\t', '<dsmlinvoke\r',
-  '<dsml tool_calls>', '<dsml tool_calls\n', '<dsml tool_calls ',
-  '<dsml invoke ', '<dsml invoke\n', '<dsml invoke\t', '<dsml invoke\r',
-  '<｜tool_calls>', '<｜tool_calls\n', '<｜tool_calls ',
-  '<｜invoke ', '<｜invoke\n', '<｜invoke\t', '<｜invoke\r',
-  '<|tool_calls>', '<|tool_calls\n', '<|tool_calls ',
-  '<|invoke ', '<|invoke\n', '<|invoke\t', '<|invoke\r',
-  '<tool_calls>', '<tool_calls\n', '<tool_calls ',
-  '<invoke ', '<invoke\n', '<invoke\t', '<invoke\r',
-];
-
-const XML_TOOL_OPENING_TAGS = [
-  '<|dsml|tool_calls',
-  '<｜dsml|tool_calls',
-  '<|dsmltool_calls',
-  '<|dsml tool_calls',
-  '<dsml|tool_calls',
-  '<dsmltool_calls',
-  '<dsml tool_calls',
-  '<｜tool_calls',
-  '<|tool_calls',
-  '<tool_calls',
-];
-
-const XML_TOOL_CLOSING_TAGS = [
-  '</|dsml|tool_calls>',
-  '</｜dsml|tool_calls>',
-  '</|dsmltool_calls>',
-  '</|dsml tool_calls>',
-  '</dsml|tool_calls>',
-  '</dsmltool_calls>',
-  '</dsml tool_calls>',
-  '</｜tool_calls>',
-  '</|tool_calls>',
-  '</tool_calls>',
-];
-
-module.exports = {
-  XML_TOOL_SEGMENT_TAGS,
-  XML_TOOL_OPENING_TAGS,
-  XML_TOOL_CLOSING_TAGS,
-};
--- a/internal/promptcompat/history_transcript.go
+++ b/internal/promptcompat/history_transcript.go
@@ -1,13 +1,12 @@
 package promptcompat

 import (
-	"fmt"
 	"strings"

 	"ds2api/internal/prompt"
 )

-const historySplitInjectedFilename = "IGNORE"
+const CurrentInputContextFilename = "history.txt"

 func BuildOpenAIHistoryTranscript(messages []any) string {
 	return buildOpenAIInjectedFileTranscript(messages)
@@ -32,5 +31,5 @@ func buildOpenAIInjectedFileTranscript(messages []any) string {
 	if transcript == "" {
 		return ""
 	}
-	return fmt.Sprintf("[file content end]\n\n%s\n\n[file name]: %s\n[file content begin]\n", transcript, historySplitInjectedFilename)
+	return transcript
 }
--- a/internal/promptcompat/prompt_build_test.go
+++ b/internal/promptcompat/prompt_build_test.go
@@ -88,6 +88,58 @@ func TestBuildOpenAIFinalPrompt_VercelPreparePathKeepsFinalAnswerInstruction(t *
 	}
 }

+func TestBuildOpenAIFinalPromptReadLikeToolIncludesCacheGuard(t *testing.T) {
+	messages := []any{
+		map[string]any{"role": "user", "content": "请读取文件"},
+	}
+	tools := []any{
+		map[string]any{
+			"type": "function",
+			"function": map[string]any{
+				"name":        "read_file",
+				"description": "Read a file",
+				"parameters": map[string]any{
+					"type": "object",
+				},
+			},
+		},
+	}
+
+	finalPrompt, _ := buildOpenAIFinalPrompt(messages, tools, "", false)
+	if !strings.Contains(finalPrompt, "Read-tool cache guard") {
+		t.Fatalf("read-like tool prompt missing cache guard: %q", finalPrompt)
+	}
+	if !strings.Contains(finalPrompt, "provides no file body") {
+		t.Fatalf("read-like tool prompt missing no-body handling: %q", finalPrompt)
+	}
+	if !strings.Contains(finalPrompt, "Do not repeatedly call the same read request") {
+		t.Fatalf("read-like tool prompt missing loop guard: %q", finalPrompt)
+	}
+}
+
+func TestBuildOpenAIFinalPromptNonReadToolOmitsCacheGuard(t *testing.T) {
+	messages := []any{
+		map[string]any{"role": "user", "content": "搜索一下"},
+	}
+	tools := []any{
+		map[string]any{
+			"type": "function",
+			"function": map[string]any{
+				"name":        "search",
+				"description": "Search docs",
+				"parameters": map[string]any{
+					"type": "object",
+				},
+			},
+		},
+	}
+
+	finalPrompt, _ := buildOpenAIFinalPrompt(messages, tools, "", false)
+	if strings.Contains(finalPrompt, "Read-tool cache guard") {
+		t.Fatalf("non-read tool prompt should not include read cache guard: %q", finalPrompt)
+	}
+}
+
 func TestBuildOpenAIFinalPromptWithThinkingKeepsPromptUnchanged(t *testing.T) {
 	messages := []any{
 		map[string]any{"role": "user", "content": "继续回答上一个问题"},
--- a/internal/promptcompat/request_normalize.go
+++ b/internal/promptcompat/request_normalize.go
@@ -39,20 +39,22 @@ func NormalizeOpenAIChatRequest(store ConfigReader, req map[string]any, traceID
 	refFileIDs := CollectOpenAIRefFileIDs(req)

 	return StandardRequest{
-		Surface:        "openai_chat",
-		RequestedModel: strings.TrimSpace(model),
-		ResolvedModel:  resolvedModel,
-		ResponseModel:  responseModel,
-		Messages:       messagesRaw,
-		ToolsRaw:       req["tools"],
-		FinalPrompt:    finalPrompt,
-		ToolNames:      toolNames,
-		ToolChoice:     toolPolicy,
-		Stream:         util.ToBool(req["stream"]),
-		Thinking:       thinkingEnabled,
-		Search:         searchEnabled,
-		RefFileIDs:     refFileIDs,
-		PassThrough:    passThrough,
+		Surface:         "openai_chat",
+		RequestedModel:  strings.TrimSpace(model),
+		ResolvedModel:   resolvedModel,
+		ResponseModel:   responseModel,
+		Messages:        messagesRaw,
+		PromptTokenText: finalPrompt,
+		ToolsRaw:        req["tools"],
+		FinalPrompt:     finalPrompt,
+		ToolNames:       toolNames,
+		ToolChoice:      toolPolicy,
+		Stream:          util.ToBool(req["stream"]),
+		Thinking:        thinkingEnabled,
+		Search:          searchEnabled,
+		RefFileIDs:      refFileIDs,
+		RefFileTokens:   estimateInlineFileTokens(req),
+		PassThrough:     passThrough,
 	}, nil
 }

@@ -99,20 +101,22 @@ func NormalizeOpenAIResponsesRequest(store ConfigReader, req map[string]any, tra
 	refFileIDs := CollectOpenAIRefFileIDs(req)

 	return StandardRequest{
-		Surface:        "openai_responses",
-		RequestedModel: model,
-		ResolvedModel:  resolvedModel,
-		ResponseModel:  model,
-		Messages:       messagesRaw,
-		ToolsRaw:       req["tools"],
-		FinalPrompt:    finalPrompt,
-		ToolNames:      toolNames,
-		ToolChoice:     toolPolicy,
-		Stream:         util.ToBool(req["stream"]),
-		Thinking:       thinkingEnabled,
-		Search:         searchEnabled,
-		RefFileIDs:     refFileIDs,
-		PassThrough:    passThrough,
+		Surface:         "openai_responses",
+		RequestedModel:  model,
+		ResolvedModel:   resolvedModel,
+		ResponseModel:   model,
+		Messages:        messagesRaw,
+		PromptTokenText: finalPrompt,
+		ToolsRaw:        req["tools"],
+		FinalPrompt:     finalPrompt,
+		ToolNames:       toolNames,
+		ToolChoice:      toolPolicy,
+		Stream:          util.ToBool(req["stream"]),
+		Thinking:        thinkingEnabled,
+		Search:          searchEnabled,
+		RefFileIDs:      refFileIDs,
+		RefFileTokens:   estimateInlineFileTokens(req),
+		PassThrough:     passThrough,
 	}, nil
 }

@@ -356,3 +360,30 @@ func namesToSet(names []string) map[string]struct{} {
 	}
 	return out
 }
+
+// estimateInlineFileTokens extracts the byte count stashed by PreprocessInlineFileInputs
+// and converts it to a conservative token estimate. Inline files are typically images or
+// documents that the upstream model will process; we use bytes/3 (rather than bytes/4)
+// as a slightly pessimistic approximation so the returned context token count stays
+// safely above the real value.
+func estimateInlineFileTokens(req map[string]any) int {
+	raw, ok := req["_inline_file_bytes"]
+	if !ok {
+		return 0
+	}
+	var bytes int
+	switch v := raw.(type) {
+	case int:
+		bytes = v
+	case int64:
+		bytes = int(v)
+	case float64:
+		bytes = int(v)
+	default:
+		return 0
+	}
+	if bytes <= 0 {
+		return 0
+	}
+	return bytes / 3
+}
--- a/internal/promptcompat/standard_request.go
+++ b/internal/promptcompat/standard_request.go
@@ -9,6 +9,7 @@ type StandardRequest struct {
 	ResponseModel           string
 	Messages                []any
 	HistoryText             string
+	PromptTokenText         string
 	CurrentInputFileApplied bool
 	ToolsRaw                any
 	FinalPrompt             string
@@ -18,6 +19,7 @@ type StandardRequest struct {
 	Thinking                bool
 	Search                  bool
 	RefFileIDs              []string
+	RefFileTokens           int
 	PassThrough             map[string]any
 }

--- a/internal/promptcompat/standard_request_test.go
+++ b/internal/promptcompat/standard_request_test.go
@@ -13,7 +13,7 @@ func TestStandardRequestCompletionPayloadSetsModelTypeFromResolvedModel(t *testi
 		{name: "default", model: "deepseek-v4-flash", thinking: false, search: false, modelType: "default"},
 		{name: "default_nothinking", model: "deepseek-v4-flash-nothinking", thinking: false, search: false, modelType: "default"},
 		{name: "expert", model: "deepseek-v4-pro", thinking: true, search: false, modelType: "expert"},
-		{name: "vision", model: "deepseek-v4-vision-search", thinking: false, search: true, modelType: "vision"},
+		{name: "vision", model: "deepseek-v4-vision", thinking: true, search: false, modelType: "vision"},
 	}

 	for _, tc := range tests {
--- a/internal/promptcompat/tool_prompt.go
+++ b/internal/promptcompat/tool_prompt.go
@@ -4,6 +4,7 @@ import (
 	"encoding/json"
 	"fmt"
 	"strings"
+	"unicode"

 	"ds2api/internal/toolcall"
 )
@@ -30,13 +31,7 @@ func injectToolPrompt(messages []map[string]any, tools []any, policy ToolChoiceP
 		if !ok {
 			continue
 		}
-		fn, _ := tool["function"].(map[string]any)
-		if len(fn) == 0 {
-			fn = tool
-		}
-		name, _ := fn["name"].(string)
-		desc, _ := fn["description"].(string)
-		schema, _ := fn["parameters"].(map[string]any)
+		name, desc, schema := toolcall.ExtractToolMeta(tool)
 		name = strings.TrimSpace(name)
 		if !isAllowed(name) {
 			continue
@@ -52,6 +47,9 @@ func injectToolPrompt(messages []map[string]any, tools []any, policy ToolChoiceP
 		return messages, names
 	}
 	toolPrompt := "You have access to these tools:\n\n" + strings.Join(toolSchemas, "\n\n") + "\n\n" + toolcall.BuildToolCallInstructions(names)
+	if hasReadLikeTool(names) {
+		toolPrompt += "\n\nRead-tool cache guard: If a Read/read_file-style tool result says the file is unchanged, already available in history, should be referenced from previous context, or otherwise provides no file body, treat that result as missing content. Do not repeatedly call the same read request for that missing body. Request a full-content read if the tool supports it, or tell the user that the file contents need to be provided again."
+	}
 	if policy.Mode == ToolChoiceRequired {
 		toolPrompt += "\n7) For this response, you MUST call at least one tool from the allowed list."
 	}
@@ -70,3 +68,23 @@ func injectToolPrompt(messages []map[string]any, tools []any, policy ToolChoiceP
 	messages = append([]map[string]any{{"role": "system", "content": toolPrompt}}, messages...)
 	return messages, names
 }
+
+func hasReadLikeTool(names []string) bool {
+	for _, name := range names {
+		switch normalizeToolNameForGuard(name) {
+		case "read", "readfile":
+			return true
+		}
+	}
+	return false
+}
+
+func normalizeToolNameForGuard(name string) string {
+	var b strings.Builder
+	for _, r := range strings.ToLower(strings.TrimSpace(name)) {
+		if unicode.IsLetter(r) || unicode.IsDigit(r) {
+			b.WriteRune(r)
+		}
+	}
+	return b.String()
+}
--- a/internal/server/router.go
+++ b/internal/server/router.go
@@ -98,6 +98,14 @@ func NewApp() (*App, error) {
 	r.Get("/v1/responses/{response_id}", responsesHandler.GetResponseByID)
 	r.Post("/v1/files", filesHandler.UploadFile)
 	r.Post("/v1/embeddings", embeddingsHandler.Embeddings)
+	// Root OpenAI aliases support clients configured with the bare DS2API service URL.
+	r.Get("/models", modelsHandler.ListModels)
+	r.Get("/models/{model_id}", modelsHandler.GetModel)
+	r.Post("/chat/completions", chatHandler.ChatCompletions)
+	r.Post("/responses", responsesHandler.Responses)
+	r.Get("/responses/{response_id}", responsesHandler.GetResponseByID)
+	r.Post("/files", filesHandler.UploadFile)
+	r.Post("/embeddings", embeddingsHandler.Embeddings)
 	claude.RegisterRoutes(r, claudeHandler)
 	gemini.RegisterRoutes(r, geminiHandler)
 	r.Route("/admin", func(ar chi.Router) {
--- a/internal/server/router_routes_test.go
+++ b/internal/server/router_routes_test.go
@@ -37,6 +37,13 @@ func TestAPIRoutesRemainRegistered(t *testing.T) {
 		"GET /v1/responses/{response_id}",
 		"POST /v1/files",
 		"POST /v1/embeddings",
+		"GET /models",
+		"GET /models/{model_id}",
+		"POST /chat/completions",
+		"POST /responses",
+		"GET /responses/{response_id}",
+		"POST /files",
+		"POST /embeddings",
 		"GET /anthropic/v1/models",
 		"POST /anthropic/v1/messages",
 		"POST /anthropic/v1/messages/count_tokens",
--- a/internal/sse/consumer_edge_test.go
+++ b/internal/sse/consumer_edge_test.go
@@ -41,6 +41,15 @@ func TestCollectStreamTextOnly(t *testing.T) {
 	}
 }

+func TestCollectStreamHandlesLongSingleSSELine(t *testing.T) {
+	payload := strings.Repeat("x", 2*1024*1024+4096)
+	resp := makeHTTPResponse(makeLargeContentSSEBody(t, payload))
+	result := CollectStream(resp, false, true)
+	if result.Text != payload {
+		t.Fatalf("long SSE line payload mismatch: got len=%d want len=%d", len(result.Text), len(payload))
+	}
+}
+
 func TestCollectStreamThinkingAndText(t *testing.T) {
 	resp := makeHTTPResponse(
 		"data: {\"p\":\"response/thinking_content\",\"v\":\"Thinking...\"}\n" +
--- a/internal/sse/parser.go
+++ b/internal/sse/parser.go
@@ -92,6 +92,7 @@ func ParseSSEChunkForContentDetailed(chunk map[string]any, thinkingEnabled bool,
 	}
 	newType := currentFragmentType
 	parts := make([]ContentPart, 0, 8)
+	updateTypeFromExplicitPath(path, thinkingEnabled, &newType)
 	collectDirectFragments(path, chunk, v, &newType, &parts)
 	updateTypeFromNestedResponse(path, v, &newType)
 	partType := resolvePartType(path, thinkingEnabled, newType)
@@ -107,11 +108,24 @@ func ParseSSEChunkForContentDetailed(chunk map[string]any, thinkingEnabled bool,
 	detectionThinkingParts := selectThinkingParts(parts)
 	if !thinkingEnabled {
 		parts = dropThinkingParts(parts)
-		newType = "text"
 	}
 	return parts, detectionThinkingParts, false, newType
 }

+func updateTypeFromExplicitPath(path string, thinkingEnabled bool, newType *string) {
+	if newType == nil {
+		return
+	}
+	switch path {
+	case "response/content":
+		*newType = "text"
+	case "response/thinking_content":
+		if !thinkingEnabled || *newType != "text" {
+			*newType = "thinking"
+		}
+	}
+}
+
 func selectThinkingParts(parts []ContentPart) []ContentPart {
 	if len(parts) == 0 {
 		return nil
@@ -206,8 +220,11 @@ func resolvePartType(path string, thinkingEnabled bool, newType string) string {
 		return "text"
 	case strings.Contains(path, "response/fragments") && strings.Contains(path, "/content"):
 		return newType
-	case path == "" && thinkingEnabled:
-		return newType
+	case path == "":
+		if newType != "" {
+			return newType
+		}
+		return "text"
 	default:
 		return "text"
 	}
@@ -244,11 +261,29 @@ func appendChunkValueContent(v any, partType string, newType *string, parts *[]C
 		}
 		*parts = append(*parts, pp...)
 	case map[string]any:
+		if appendObjectContentByPath(path, val, partType, parts) {
+			return false
+		}
 		appendWrappedFragments(val, partType, newType, parts)
 	}
 	return false
 }

+func appendObjectContentByPath(path string, val map[string]any, partType string, parts *[]ContentPart) bool {
+	if path != "response/content" && path != "response/thinking_content" && path != "" {
+		return false
+	}
+	text, _ := val["text"].(string)
+	if text == "" {
+		text, _ = val["content"].(string)
+	}
+	if text == "" {
+		return false
+	}
+	appendContentPart(parts, text, partType)
+	return true
+}
+
 func appendWrappedFragments(val map[string]any, partType string, newType *string, parts *[]ContentPart) {
 	resp := val
 	if wrapped, ok := val["response"].(map[string]any); ok {
--- a/internal/sse/parser_test.go
+++ b/internal/sse/parser_test.go
@@ -88,6 +88,71 @@ func TestParseSSEChunkForContentAfterAppendUsesUpdatedType(t *testing.T) {
 	}
 }

+func TestParseSSEChunkForContentThinkingDisabledKeepsHiddenFragmentState(t *testing.T) {
+	chunk1 := map[string]any{
+		"p": "response/fragments",
+		"o": "APPEND",
+		"v": []any{
+			map[string]any{"type": "THINK", "content": "我们"},
+		},
+	}
+	parts1, finished1, nextType1 := ParseSSEChunkForContent(chunk1, false, "text")
+	if finished1 {
+		t.Fatal("expected first chunk unfinished")
+	}
+	if nextType1 != "thinking" {
+		t.Fatalf("expected hidden THINK fragment to keep next type thinking, got %q", nextType1)
+	}
+	if len(parts1) != 0 {
+		t.Fatalf("expected hidden thinking to be dropped, got %#v", parts1)
+	}
+
+	chunk2 := map[string]any{
+		"p": "response/fragments/-1/content",
+		"v": "被",
+	}
+	parts2, finished2, nextType2 := ParseSSEChunkForContent(chunk2, false, nextType1)
+	if finished2 {
+		t.Fatal("expected second chunk unfinished")
+	}
+	if nextType2 != "thinking" {
+		t.Fatalf("expected hidden continuation to keep next type thinking, got %q", nextType2)
+	}
+	if len(parts2) != 0 {
+		t.Fatalf("expected hidden continuation to be dropped, got %#v", parts2)
+	}
+
+	chunk3 := map[string]any{"v": "要求"}
+	parts3, finished3, nextType3 := ParseSSEChunkForContent(chunk3, false, nextType2)
+	if finished3 {
+		t.Fatal("expected third chunk unfinished")
+	}
+	if nextType3 != "thinking" {
+		t.Fatalf("expected pathless hidden continuation to keep next type thinking, got %q", nextType3)
+	}
+	if len(parts3) != 0 {
+		t.Fatalf("expected pathless hidden continuation to be dropped, got %#v", parts3)
+	}
+
+	chunk4 := map[string]any{
+		"p": "response/fragments",
+		"o": "APPEND",
+		"v": []any{
+			map[string]any{"type": "RESPONSE", "content": "答"},
+		},
+	}
+	parts4, finished4, nextType4 := ParseSSEChunkForContent(chunk4, false, nextType3)
+	if finished4 {
+		t.Fatal("expected fourth chunk unfinished")
+	}
+	if nextType4 != "text" {
+		t.Fatalf("expected RESPONSE fragment to switch next type text, got %q", nextType4)
+	}
+	if len(parts4) != 1 || parts4[0].Type != "text" || parts4[0].Text != "答" {
+		t.Fatalf("expected visible response text, got %#v", parts4)
+	}
+}
+
 func TestParseSSEChunkForContentAutoTransitionsThinkClose(t *testing.T) {
 	chunk := map[string]any{
 		"p": "response/thinking_content",
@@ -163,3 +228,44 @@ func TestParseSSEChunkForContentStripsLeakedThinkTagsFromText(t *testing.T) {
 		t.Fatalf("expected leaked think tag to be stripped, got %#v", parts[0])
 	}
 }
+
+func TestParseSSEChunkForContentResponseContentObjectShape(t *testing.T) {
+	chunk := map[string]any{
+		"p": "response/content",
+		"v": map[string]any{"text": "对象内容"},
+	}
+	parts, finished, _ := ParseSSEChunkForContent(chunk, false, "text")
+	if finished {
+		t.Fatal("expected unfinished")
+	}
+	if len(parts) != 1 || parts[0].Text != "对象内容" || parts[0].Type != "text" {
+		t.Fatalf("unexpected parts: %#v", parts)
+	}
+}
+
+func TestParseSSEChunkForThinkingContentObjectShape(t *testing.T) {
+	chunk := map[string]any{
+		"p": "response/thinking_content",
+		"v": map[string]any{"content": "对象思考"},
+	}
+	parts, finished, _ := ParseSSEChunkForContent(chunk, true, "thinking")
+	if finished {
+		t.Fatal("expected unfinished")
+	}
+	if len(parts) != 1 || parts[0].Text != "对象思考" || parts[0].Type != "thinking" {
+		t.Fatalf("unexpected parts: %#v", parts)
+	}
+}
+
+func TestParseSSEChunkForContentObjectShapeWithoutPath(t *testing.T) {
+	chunk := map[string]any{
+		"v": map[string]any{"text": "无路径对象内容"},
+	}
+	parts, finished, _ := ParseSSEChunkForContent(chunk, false, "text")
+	if finished {
+		t.Fatal("expected unfinished")
+	}
+	if len(parts) != 1 || parts[0].Text != "无路径对象内容" || parts[0].Type != "text" {
+		t.Fatalf("unexpected parts: %#v", parts)
+	}
+}
--- a/internal/sse/stream.go
+++ b/internal/sse/stream.go
@@ -4,12 +4,14 @@ import (
 	"bufio"
 	"context"
 	"io"
+	"time"
 )

 const (
 	parsedLineBufferSize = 128
-	scannerBufferSize    = 64 * 1024
-	maxScannerLineSize   = 2 * 1024 * 1024
+	lineReaderBufferSize = 64 * 1024
+	minFlushChars        = 160
+	maxFlushWait         = 80 * time.Millisecond
 )

 // StartParsedLinePump scans an upstream DeepSeek SSE body and emits normalized
@@ -20,21 +22,131 @@ func StartParsedLinePump(ctx context.Context, body io.Reader, thinkingEnabled bo
 	done := make(chan error, 1)
 	go func() {
 		defer close(out)
-		scanner := bufio.NewScanner(body)
-		scanner.Buffer(make([]byte, 0, scannerBufferSize), maxScannerLineSize)
+		type scanItem struct {
+			line []byte
+			err  error
+			eof  bool
+		}
+		lineCh := make(chan scanItem, 1)
+		stopReader := make(chan struct{})
+		defer close(stopReader)
+		go func() {
+			sendScanItem := func(item scanItem) bool {
+				select {
+				case lineCh <- item:
+					return true
+				case <-ctx.Done():
+					return false
+				case <-stopReader:
+					return false
+				}
+			}
+			defer close(lineCh)
+			reader := bufio.NewReaderSize(body, lineReaderBufferSize)
+			for {
+				line, err := reader.ReadBytes('\n')
+				if len(line) > 0 {
+					line = append([]byte{}, line...)
+					if !sendScanItem(scanItem{line: line}) {
+						return
+					}
+				}
+				if err != nil {
+					if err == io.EOF {
+						err = nil
+					}
+					_ = sendScanItem(scanItem{err: err, eof: true})
+					return
+				}
+			}
+		}()
+
+		ticker := time.NewTicker(maxFlushWait)
+		defer ticker.Stop()
 		currentType := initialType
-		for scanner.Scan() {
-			line := append([]byte{}, scanner.Bytes()...)
-			result := ParseDeepSeekContentLine(line, thinkingEnabled, currentType)
-			currentType = result.NextType
+		var pending *LineResult
+		pendingChars := 0
+
+		sendResult := func(r LineResult) bool {
+			select {
+			case out <- r:
+				return true
+			case <-ctx.Done():
+				done <- ctx.Err()
+				return false
+			}
+		}
+
+		flushPending := func() bool {
+			if pending == nil {
+				return true
+			}
+			if !sendResult(*pending) {
+				return false
+			}
+			pending = nil
+			pendingChars = 0
+			return true
+		}
+
+		for {
 			select {
-			case out <- result:
 			case <-ctx.Done():
 				done <- ctx.Err()
 				return
+			case <-ticker.C:
+				if !flushPending() {
+					return
+				}
+			case item, ok := <-lineCh:
+				if !ok || item.eof {
+					if !flushPending() {
+						return
+					}
+					done <- item.err
+					return
+				}
+				line := item.line
+				result := ParseDeepSeekContentLine(line, thinkingEnabled, currentType)
+				currentType = result.NextType
+
+				canAccumulate := result.Parsed && !result.Stop && result.ErrorMessage == "" && !result.ContentFilter && result.ResponseMessageID == 0
+				if canAccumulate {
+					lineChars := 0
+					for _, p := range result.Parts {
+						lineChars += len(p.Text)
+					}
+					for _, p := range result.ToolDetectionThinkingParts {
+						lineChars += len(p.Text)
+					}
+					if lineChars > 0 {
+						if pending == nil {
+							cp := result
+							pending = &cp
+						} else {
+							pending.Parts = append(pending.Parts, result.Parts...)
+							pending.ToolDetectionThinkingParts = append(pending.ToolDetectionThinkingParts, result.ToolDetectionThinkingParts...)
+							pending.NextType = result.NextType
+						}
+						pendingChars += lineChars
+						if pendingChars < minFlushChars {
+							continue
+						}
+						if !flushPending() {
+							return
+						}
+						continue
+					}
+				}
+
+				if !flushPending() {
+					return
+				}
+				if !sendResult(result) {
+					return
+				}
 			}
 		}
-		done <- scanner.Err()
 	}()
 	return out, done
 }
--- a/internal/sse/stream_edge_test.go
+++ b/internal/sse/stream_edge_test.go
@@ -38,8 +38,8 @@ func TestStartParsedLinePumpMultipleLines(t *testing.T) {
 	if err := <-done; err != nil {
 		t.Fatalf("unexpected error: %v", err)
 	}
-	if len(collected) < 3 {
-		t.Fatalf("expected at least 3 results, got %d", len(collected))
+	if len(collected) < 2 {
+		t.Fatalf("expected at least 2 results, got %d", len(collected))
 	}
 	// First should be thinking
 	if collected[0].Parts[0].Type != "thinking" {
@@ -158,11 +158,13 @@ func TestStartParsedLinePumpNonSSELines(t *testing.T) {

 func TestStartParsedLinePumpThinkingDisabled(t *testing.T) {
 	body := strings.NewReader(
-		"data: {\"p\":\"response/thinking_content\",\"v\":\"thought\"}\n" +
+		"data: {\"p\":\"response/fragments\",\"o\":\"APPEND\",\"v\":[{\"type\":\"THINK\",\"content\":\"思\"}]}\n" +
+			"data: {\"p\":\"response/fragments/-1/content\",\"v\":\"考\"}\n" +
+			"data: {\"v\":\"隐藏\"}\n" +
+			"data: {\"p\":\"response/fragments\",\"o\":\"APPEND\",\"v\":[{\"type\":\"RESPONSE\",\"content\":\"答\"}]}\n" +
 			"data: {\"p\":\"response/content\",\"v\":\"response\"}\n" +
 			"data: [DONE]\n",
 	)
-	// With thinking disabled, thinking content should still be emitted but marked differently
 	results, done := StartParsedLinePump(context.Background(), body, false, "text")

 	var parts []ContentPart
@@ -171,7 +173,42 @@ func TestStartParsedLinePumpThinkingDisabled(t *testing.T) {
 	}
 	<-done

-	if len(parts) < 1 {
-		t.Fatalf("expected at least 1 part, got %d", len(parts))
+	got := strings.Builder{}
+	for _, p := range parts {
+		if p.Type != "text" {
+			t.Fatalf("expected only text parts with thinking disabled, got %#v", parts)
+		}
+		got.WriteString(p.Text)
+	}
+	if got.String() != "答response" {
+		t.Fatalf("expected hidden thinking to be dropped, got %q from %#v", got.String(), parts)
+	}
+}
+
+func TestStartParsedLinePumpAccumulatesSmallChunks(t *testing.T) {
+	body := strings.NewReader(
+		"data: {\"p\":\"response/content\",\"v\":\"h\"}\n" +
+			"data: {\"p\":\"response/content\",\"v\":\"i\"}\n" +
+			"data: [DONE]\n",
+	)
+
+	results, done := StartParsedLinePump(context.Background(), body, false, "text")
+
+	collected := make([]LineResult, 0)
+	for r := range results {
+		collected = append(collected, r)
+	}
+	if err := <-done; err != nil {
+		t.Fatalf("unexpected error: %v", err)
+	}
+
+	if len(collected) != 2 {
+		t.Fatalf("expected 2 results (accumulated content + done), got %d", len(collected))
+	}
+	if len(collected[0].Parts) != 2 {
+		t.Fatalf("expected 2 accumulated parts, got %d", len(collected[0].Parts))
+	}
+	if !collected[1].Stop {
+		t.Fatal("expected second result to stop")
 	}
 }
--- a/internal/sse/stream_test.go
+++ b/internal/sse/stream_test.go
@@ -2,10 +2,23 @@ package sse

 import (
 	"context"
+	"encoding/json"
 	"strings"
 	"testing"
 )

+func makeLargeContentSSEBody(t *testing.T, payload string) string {
+	t.Helper()
+	line, err := json.Marshal(map[string]any{
+		"p": "response/content",
+		"v": payload,
+	})
+	if err != nil {
+		t.Fatalf("marshal SSE line failed: %v", err)
+	}
+	return "data: " + string(line) + "\n" + "data: [DONE]\n"
+}
+
 func TestStartParsedLinePumpParsesAndStops(t *testing.T) {
 	body := strings.NewReader("data: {\"p\":\"response/content\",\"v\":\"hi\"}\n\ndata: [DONE]\n")
 	results, done := StartParsedLinePump(context.Background(), body, false, "text")
@@ -28,3 +41,28 @@ func TestStartParsedLinePumpParsesAndStops(t *testing.T) {
 		t.Fatalf("expected last line to stop stream, got parsed=%v stop=%v", last.Parsed, last.Stop)
 	}
 }
+
+func TestStartParsedLinePumpHandlesLongSingleSSELine(t *testing.T) {
+	payload := strings.Repeat("x", 2*1024*1024+4096)
+	results, done := StartParsedLinePump(context.Background(), strings.NewReader(makeLargeContentSSEBody(t, payload)), false, "text")
+
+	var got strings.Builder
+	var sawDone bool
+	for r := range results {
+		for _, p := range r.Parts {
+			got.WriteString(p.Text)
+		}
+		if r.Stop {
+			sawDone = true
+		}
+	}
+	if err := <-done; err != nil {
+		t.Fatalf("unexpected long-line read error: %v", err)
+	}
+	if got.String() != payload {
+		t.Fatalf("long SSE line payload mismatch: got len=%d want len=%d", got.Len(), len(payload))
+	}
+	if !sawDone {
+		t.Fatal("expected DONE after long SSE line")
+	}
+}
--- a/internal/toolcall/toolcalls_dsml.go
+++ b/internal/toolcall/toolcalls_dsml.go
@@ -44,6 +44,9 @@ func rewriteDSMLToolMarkupOutsideIgnored(text string) string {
 			}
 			b.WriteString(tag.Name)
 			b.WriteString(text[tag.NameEnd : tag.End+1])
+			if text[tag.End] != '>' {
+				b.WriteByte('>')
+			}
 			i = tag.End + 1
 			continue
 		}
--- a/internal/toolcall/toolcalls_parse.go
+++ b/internal/toolcall/toolcalls_parse.go
@@ -92,11 +92,45 @@ func filterToolCallsDetailed(parsed []ParsedToolCall) ([]ParsedToolCall, []strin
 		if tc.Input == nil {
 			tc.Input = map[string]any{}
 		}
+		if len(tc.Input) > 0 && !toolCallInputHasMeaningfulValue(tc.Input) {
+			continue
+		}
 		out = append(out, tc)
 	}
 	return out, nil
 }

+func toolCallInputHasMeaningfulValue(v any) bool {
+	switch x := v.(type) {
+	case nil:
+		return false
+	case string:
+		return strings.TrimSpace(x) != ""
+	case map[string]any:
+		if len(x) == 0 {
+			return false
+		}
+		for _, child := range x {
+			if toolCallInputHasMeaningfulValue(child) {
+				return true
+			}
+		}
+		return false
+	case []any:
+		if len(x) == 0 {
+			return false
+		}
+		for _, child := range x {
+			if toolCallInputHasMeaningfulValue(child) {
+				return true
+			}
+		}
+		return false
+	default:
+		return true
+	}
+}
+
 func looksLikeToolCallSyntax(text string) bool {
 	hasDSML, hasCanonical := ContainsToolCallWrapperSyntaxOutsideIgnored(text)
 	return hasDSML || hasCanonical
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
CJACK.	445c95a4f2	Merge pull request #379 from CJackHwang/dev Merge pull request #377 from CJackHwang/codex/run-all-tests-and-fix-failures Fix failing current-input token accounting test	2026-05-01 16:12:17 +08:00
CJACK	0a6ef8e3f2	fix: remove bufio.Scanner 2MiB line limit for SSE; support quasi_status direct patch Replace bufio.Scanner with bufio.NewReaderSize + ReadBytes('\n') across all SSE read paths to preserve long single-line data (e.g. write_file content). Add quasi_status and auto_continue handling as direct path-based patches in both Go continue observer and Node vercel_stream_impl, mirroring existing batch-patch logic. Add 2MiB+ line throughput tests at every SSE layer. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-01 15:45:17 +08:00
CJACK	fd0ec29991	refactor: generalize DSML tag parsing to tolerate model noise; split tiktoken by build tags Replace hardcoded DSML typo variant lists in Go/Node tool call parsers with generalized prefix consumption that tolerates repeated leading <, repeated DSML prefix noise, and trailing pipe terminators. Split tiktoken-dependent token counting into a build-tagged file for non-cgo platform compatibility. Add /data directory to Dockerfile for bind-mount permissions. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-01 15:17:11 +08:00
CJACK	2671298439	fix: coalesce small stream deltas to prevent character swallowing; add read-tool cache guard Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-01 13:53:27 +08:00
CJACK	92e321fe2c	修复吞字问题	2026-05-01 01:31:48 +08:00
CJACK.	fca8c01397	Merge pull request #385 from ouqiting/fix_chat_history fix: content being overwritten and left empty	2026-05-01 00:21:33 +08:00
ouqiting	667e1e3710	fix: content being overwritten and left empty	2026-04-30 21:36:47 +08:00
CJACK.	4438d03c5c	Merge pull request #377 from CJackHwang/codex/run-all-tests-and-fix-failures Fix failing current-input token accounting test	2026-04-30 02:41:08 +08:00
CJACK.	95b7665643	Merge branch 'dev' into codex/run-all-tests-and-fix-failures	2026-04-30 02:39:18 +08:00
CJACK.	9896b1fc33	Merge pull request #378 from NgoQuocViet2001/ai/openai-root-route-aliases feat(openai): add root route aliases	2026-04-30 02:33:38 +08:00
CJACK.	966f21211d	Fix nil-session guard in chat history test	2026-04-30 02:31:06 +08:00
NgoQuocViet2001	7dc3af40b2	feat(openai): add root route aliases	2026-04-30 01:24:53 +07:00
CJACK.	2f6b5ffda0	Fix current-input token text test expectation	2026-04-30 02:22:17 +08:00
CJACK.	85e256ad4d	Merge pull request #375 from CJackHwang/codex/investigate-data-loss-issue-in-pr-369 sse/parser: treat object-shaped `v` as visible content, preserve INCOMPLETE across omitted status; add tests and samples	2026-04-30 02:14:26 +08:00
CJACK.	7c3ff6ee7e	Merge pull request #374 from shern-point/feat/full-context-file-token-accounting Feat/full context file token accounting	2026-04-30 02:12:55 +08:00
CJACK.	63e62fd1b0	Merge pull request #372 from shern-point/feat/accurate-context-token-length Feat/accurate context token length	2026-04-30 02:11:32 +08:00
CJACK.	483d7af3d2	Merge pull request #373 from NgoQuocViet2001/ai/ds2api-small-regression-fix fix(openai): return 400 for inline file limit	2026-04-30 02:08:22 +08:00
CJACK.	0f89823526	chore(sse): bump client version and refresh longtext stream fixtures	2026-04-30 02:05:45 +08:00
shern-point	6a778e0d35	feat: include inline-uploaded file tokens in context token accounting Track byte sizes of inline-uploaded files during PreprocessInlineFileInputs and convert them to conservative token estimates (bytes/3). RefFileTokens is threaded through StandardRequest into all OpenAI chat/responses usage builders so returned prompt_tokens/input_tokens reflect the full upstream context cost including attached files.	2026-04-30 01:42:51 +08:00
NgoQuocViet2001	9035c350a7	fix(openai): return 400 for inline file limit	2026-04-30 00:35:59 +07:00
shern-point	ba80052a26	fix: count uploaded file content in context token accounting PromptTokenText now reflects the actual downstream context cost: the uploaded IGNORE.txt file content plus the neutral live prompt, instead of only the pre-split prompt text.	2026-04-30 01:12:35 +08:00
shern-point	78fdd63470	feat: add full-context token regression coverage and docs Lock in the current_input_file regression with API-level tests and document that returned context token counts now track full prompt semantics with conservative sizing.	2026-04-30 00:46:06 +08:00
shern-point	4b4f097006	feat: use model-aware prompt counting in Gemini paths Preserve Gemini prompt token text during normalization and remove the hardcoded DeepSeek model from native Gemini usage helpers.	2026-04-30 00:46:05 +08:00
shern-point	d3018c281b	feat: use tokenizer-based counting in Claude token paths Unify Claude count_tokens, legacy stream accounting, and legacy render usage with preserved prompt text so Claude stops falling back to lossy message formatting.	2026-04-30 00:46:04 +08:00
shern-point	415a2359ad	feat: route OpenAI responses usage through preserved prompt text Use the stored full-context prompt text for responses accounting so neutral placeholder prompts do not underreport returned input token counts.	2026-04-30 00:45:31 +08:00
shern-point	f702d45a24	feat: route OpenAI chat usage through preserved prompt text Use the stored full-context prompt text for chat non-stream, stream, and retry accounting so current_input_file no longer shrinks returned prompt token counts.	2026-04-30 00:45:30 +08:00
shern-point	90817cb9e2	feat: apply tokenizer-based counting in OpenAI usage builders Move OpenAI chat and responses usage accounting onto the shared tokenizer-aware counters so prompt and output usage stay model-aware and conservatively sized.	2026-04-30 00:45:29 +08:00
shern-point	b96f736bd2	feat: preserve full prompt text across current_input_file rewrites Keep token accounting tied to the original prompt even after the live prompt is replaced with a neutral placeholder and hidden context file.	2026-04-30 00:45:01 +08:00
shern-point	8ab028c52a	feat: seed PromptTokenText during request normalization Capture the fully built prompt at normalization time for OpenAI and Gemini-compatible requests so usage paths can reuse the original context text.	2026-04-30 00:44:59 +08:00
shern-point	78366afec5	feat: add PromptTokenText to StandardRequest Track a dedicated prompt string for token accounting so later prompt rewrites can keep returning full-context counts.	2026-04-30 00:44:57 +08:00
shern-point	bd41c8a90c	feat: add tokenizer-based token counting utilities Use go-tiktoken with embedded vocabularies for accurate BPE token counting. CountPromptTokens applies conservative padding so returned context token counts stay slightly above the real value instead of undercounting.	2026-04-30 00:44:11 +08:00
CJACK.	bc2a78ae29	Merge pull request #370 from CJackHwang/codex/align-vercel-behavior-with-go fix(vercel): align JS stream parser with Go object-shaped content	2026-04-30 00:12:32 +08:00
CJACK.	192cdf8562	fix(vercel): align JS stream parser with Go object-shaped content	2026-04-29 23:56:16 +08:00
CJACK.	94c1acace5	Merge pull request #369 from CJackHwang/dev Merge pull request #368 from CJackHwang/codex/fix-review-issues-for-pr-#364 Restore thinking fallback for tool-call detection and drop history.txt wrapper tags	2026-04-29 23:42:50 +08:00
CJACK.	273c18ba0f	fix: fallback to /app config when /data is unavailable	2026-04-29 23:40:07 +08:00
CJACK.	ae28e33184	fix: preserve continue state when chunk status is missing	2026-04-29 23:25:18 +08:00
CJACK.	0438ce9a12	Merge pull request #368 from CJackHwang/codex/fix-review-issues-for-pr-#364 Restore thinking fallback for tool-call detection and drop history.txt wrapper tags	2026-04-29 23:07:36 +08:00
CJACK.	af4a067dab	Merge pull request #362 from CJackHwang/codex/fix-issue-based-on-feedback fix(sse): batch tiny stream chunks before emitting	2026-04-29 23:07:03 +08:00
CJACK.	33f6fef015	Fix tool-call fallback on sanitized empty text and remove history wrapper tags	2026-04-29 23:04:45 +08:00
CJACK.	6d3979a1d6	fix(sse): stop scanner sender when stream context cancels	2026-04-29 22:59:22 +08:00
CJACK.	c8922c7a88	Merge pull request #364 from adnxx1wsx/dev Fix stream compatibility and vision model exposure	2026-04-29 22:02:19 +08:00
MiY	241334c658	Fix stream compatibility and vision model exposure	2026-04-29 20:23:13 +08:00
CJACK.	d7e071b24a	Bump version from 4.1.3 to 4.2.0	2026-04-29 19:08:57 +08:00
CJACK.	89225c778e	fix(sse): batch tiny stream chunks before emitting	2026-04-29 18:58:54 +08:00
CJACK.	22160de2c4	Merge pull request #359 from NgoQuocViet2001/ai/ds2api-small-fix fix(openai): keep citation indexes one-based with zero-based references	2026-04-29 18:27:15 +08:00
NgoQuocViet2001	0cbc2c875d	fix(openai): keep citation indexes one-based	2026-04-29 15:43:09 +07:00
CJACK.	a0984ef682	Merge pull request #358 from CJackHwang/revert-356-codex/check-version-update-in-automation-scripts Revert "Verify GHCR latest tag matches release and show version source/latest in dashboard"	2026-04-29 14:49:41 +08:00
CJACK.	babfa973d6	Revert "Verify GHCR latest tag matches release and show version source/latest in dashboard"	2026-04-29 14:47:53 +08:00
CJACK.	ba4071d8b5	Merge pull request #357 from CJackHwang/codex/update-documentation-for-config.json-permissions Return config persistence warning when config path is read-only; default container config to /data/config.json and update docs	2026-04-29 14:18:25 +08:00
CJACK.	e1f8e493d2	fix: add legacy /app/config.json fallback for container upgrades	2026-04-29 14:12:20 +08:00
CJACK.	907104a735	Merge pull request #356 from CJackHwang/codex/check-version-update-in-automation-scripts Verify GHCR latest tag matches release and show version source/latest in dashboard	2026-04-29 13:53:42 +08:00
CJACK.	2c8409dcbb	fix docker defaults to writable /data config path and align docs	2026-04-29 13:46:22 +08:00
CJACK.	5c23261932	webui: show version source and latest release tag in sidebar	2026-04-29 13:45:33 +08:00
CJACK.	d7125ea106	Bump version from 4.1.2 to 4.1.3	2026-04-29 07:55:48 +08:00
CJACK.	929d9a8ef7	Merge pull request #352 from shern-point/fix/tool-string-schema-protection Fix/tool type schema protection	2026-04-29 07:51:21 +08:00
CJACK.	c03f733b83	Merge pull request #353 from Gingiris/docs/add-toc docs: add Table of Contents to README.MD and README.en.md	2026-04-29 07:50:54 +08:00
Gingiris	047fc9bee2	docs: add Table of Contents to README.MD and README.en.md Both READMEs are 400+ lines with 14 top-level sections and multiple subsections but have no navigation aid. Add a Table of Contents at the top of each file to help readers quickly find relevant sections. Changes: - README.MD: add 目录 section with links to all h2/h3 headings - README.en.md: add Table of Contents with matching structure	2026-04-28 12:18:37 -07:00
shern-point	52558838ef	docs: document request-scoped tool schema authority	2026-04-29 02:00:20 +08:00
shern-point	f1926a6ced	fix: normalize Vercel stream tool arguments by schema	2026-04-29 02:00:01 +08:00
shern-point	6e21714e23	test: cover Claude schema-aware tool normalization	2026-04-29 01:59:42 +08:00
shern-point	48c4f0df9f	fix: preserve runtime tool schemas in Claude tool output	2026-04-29 01:59:24 +08:00
shern-point	a550de30af	fix: expand shared tool schema extraction	2026-04-29 01:59:05 +08:00
CJACK.	23422e4a8e	Merge pull request #350 from ouqiting/fix_chat_histroy feat: parse split context files in list view	2026-04-29 01:34:10 +08:00
CJACK.	9c33bed403	Merge pull request #349 from RinZ27/fix-docker-non-root build: improve Docker robustness and fix potential security issues	2026-04-29 01:34:00 +08:00
ouqiting	c81294f1b7	fix(chat-history): support tool turns in parsed HISTORY list view	2026-04-29 01:27:14 +08:00
ouqiting	28d2b0410f	feat: parse split context files in list view	2026-04-29 01:15:29 +08:00
RinZ27	0c782407f5	build: improve Docker robustness and fix potential security issues	2026-04-28 23:49:54 +07:00
@@ -1 +1 @@
 .1.2
 .2.1