Merge pull request #418 from lwz762/fix/admin-css-mime-windows

fix(webui): 修复 Windows 注册表 MIME 错误导致 /admin 样式失效
fix(webui): pin Content-Type for /admin static assets
2026-05-04 16:35:27 +08:00 · 2026-05-04 14:31:47 +08:00 · 2026-05-04 10:09:50 +08:00 · 2026-05-03 20:48:53 +08:00 · 2026-05-03 20:46:22 +08:00 · 2026-05-03 20:45:31 +08:00
214 changed files with 20421 additions and 6868 deletions
--- a/.github/workflows/release-artifacts.yml
+++ b/.github/workflows/release-artifacts.yml
@@ -47,7 +47,6 @@ jobs:
      - name: Release Blocking Gates
        run: |
          ./tests/scripts/check-stage6-manual-smoke.sh
          ./tests/scripts/check-refactor-line-gate.sh
          ./tests/scripts/run-unit-all.sh
--- a/.gitignore
+++ b/.gitignore
@@ -29,6 +29,7 @@ yarn.lock
 pnpm-lock.yaml
 # Build artifacts
 dist/
 *.tsbuildinfo
 .cache/
 .parcel-cache/
--- a/API.en.md
+++ b/API.en.md
@@ -33,6 +33,8 @@ Docs: [Overview](README.en.md) / [Architecture](docs/ARCHITECTURE.en.md) / [Depl
 | Health probes | `GET /healthz`, `GET /readyz` |
 | CORS | Enabled (uniformly covers `/v1/*`, `/anthropic/*`, `/v1beta/models/*`, and `/admin/*`; echoes the browser `Origin` when present, otherwise `*`; default allow-list includes `Content-Type`, `Authorization`, `X-API-Key`, `X-Ds2-Target-Account`, `X-Ds2-Source`, `X-Vercel-Protection-Bypass`, `X-Goog-Api-Key`, `Anthropic-Version`, `Anthropic-Beta`, and also accepts third-party preflight-requested headers such as `x-stainless-*`; `/v1/chat/completions` on Vercel Node Runtime matches the same behavior; internal-only `X-Ds2-Internal-Token` remains blocked) |
 - All JSON request bodies must be valid UTF-8; malformed byte sequences are rejected on ingress with `400 invalid json`.
 ### 3.0 Adapter-Layer Notes
 - OpenAI / Claude / Gemini protocols are now mounted on one shared `chi` router tree assembled in `internal/server/router.go`.
@@ -81,7 +83,7 @@ Two header formats accepted:
 - Token is in `config.keys` → **Managed account mode**: DS2API auto-selects an account via rotation
 - Token is not in `config.keys` → **Direct token mode**: treated as a DeepSeek token directly
-**Optional header**: `X-Ds2-Target-Account: <email_or_mobile>` — Pin a specific managed account.
+**Optional header**: `X-Ds2-Target-Account: <email_or_mobile>` — Pin a specific managed account; if the target account does not exist or the managed-account queue is exhausted, the request returns `429`, and current responses do not include `Retry-After`. If the account exists but login/refresh fails, the request returns the underlying `401` or upstream error.
 Gemini-compatible clients can also send `x-goog-api-key`, `?key=`, or `?api_key=` as the caller credential source.
 ### Admin Endpoints (`/admin/*`)
@@ -109,6 +111,7 @@ Gemini-compatible clients can also send `x-goog-api-key`, `?key=`, or `?api_key=
 | GET | `/v1/responses/{response_id}` | Business | Query stored response (in-memory TTL) |
 | POST | `/v1/embeddings` | Business | OpenAI Embeddings API |
 | POST | `/v1/files` | Business | OpenAI Files upload (multipart/form-data) |
 | GET | `/v1/files/{file_id}` | Business | Retrieve uploaded file status |
 | GET | `/anthropic/v1/models` | None | Claude model list |
 | POST | `/anthropic/v1/messages` | Business | Claude messages |
 | POST | `/anthropic/v1/messages/count_tokens` | Business | Claude token counting |
@@ -165,6 +168,8 @@ Gemini-compatible clients can also send `x-goog-api-key`, `?key=`, or `?api_key=
 | PUT | `/admin/chat-history/settings` | Admin | Update conversation history retention limit |
 | GET | `/admin/version` | Admin | Check current version and latest Release |
 OpenAI `/v1/*` paths are canonical. For clients configured with the bare DS2API service URL, the same OpenAI handlers are also exposed through root shortcuts: `/models`, `/models/{id}`, `/chat/completions`, `/responses`, `/responses/{response_id}`, `/embeddings`, `/files`, and `/files/{file_id}`.
 ---
 ## Health Endpoints
@@ -196,11 +201,15 @@ No auth required. Returns the currently supported DeepSeek native model list.
  "object": "list",
  "data": [
    {"id": "deepseek-v4-flash", "object": "model", "created": 1677610602, "owned_by": "deepseek", "permission": []},
    {"id": "deepseek-v4-flash-nothinking", "object": "model", "created": 1677610602, "owned_by": "deepseek", "permission": []},
    {"id": "deepseek-v4-pro", "object": "model", "created": 1677610602, "owned_by": "deepseek", "permission": []},
    {"id": "deepseek-v4-pro-nothinking", "object": "model", "created": 1677610602, "owned_by": "deepseek", "permission": []},
    {"id": "deepseek-v4-flash-search", "object": "model", "created": 1677610602, "owned_by": "deepseek", "permission": []},
    {"id": "deepseek-v4-flash-search-nothinking", "object": "model", "created": 1677610602, "owned_by": "deepseek", "permission": []},
    {"id": "deepseek-v4-pro-search", "object": "model", "created": 1677610602, "owned_by": "deepseek", "permission": []},
    {"id": "deepseek-v4-pro-search-nothinking", "object": "model", "created": 1677610602, "owned_by": "deepseek", "permission": []},
    {"id": "deepseek-v4-vision", "object": "model", "created": 1677610602, "owned_by": "deepseek", "permission": []},
-    {"id": "deepseek-v4-vision-search", "object": "model", "created": 1677610602, "owned_by": "deepseek", "permission": []}
+    {"id": "deepseek-v4-vision-nothinking", "object": "model", "created": 1677610602, "owned_by": "deepseek", "permission": []}
  ]
 }
 ```
@@ -224,6 +233,8 @@ Built-in aliases come from `internal/config/models.go`; `config.model_aliases` c
 - Gemini: `gemini-2.5-pro`, `gemini-2.5-flash`, `gemini-pro-vision`
 - Other compatibility families: `llama-*`, `qwen-*`, `mistral-*`, and `command-*` fall back through family heuristics
 Current vision support resolves only to `deepseek-v4-vision` and does not expose a separate `vision-search` variant.
 Retired historical families such as `claude-1.*`, `claude-2.*`, `claude-instant-*`, and `gpt-3.5*` are explicitly rejected.
 ### `POST /v1/chat/completions`
@@ -297,7 +308,7 @@ data: [DONE]
 - When thinking is enabled, the stream may emit `delta.reasoning_content`
 - Text emits `delta.content`
 - Last chunk includes `finish_reason` and `usage`
- Token counting prefers pass-through from upstream DeepSeek SSE (`accumulated_token_usage` / `token_usage`), and only falls back to local estimation when upstream usage is absent
+- Token counting prefers pass-through from upstream DeepSeek SSE (`accumulated_token_usage` / `token_usage`), and only falls back to local estimation when upstream usage is absent. Failed/interrupted endings (for example `response.failed`) may not include `usage`
 #### Tool Calls
@@ -413,7 +424,7 @@ Business auth required. Returns OpenAI-compatible embeddings shape.
 | `model` | string | ✅ | Supports native models + alias mapping |
 | `input` | string/array | ✅ | Supports string, string array, token array |
-> Requires `embeddings.provider`. Current supported values: `mock` / `deterministic` / `builtin`. If missing/unsupported, returns standard error shape with HTTP 501.
+> Requires `embeddings.provider`. Current supported values: `mock` / `deterministic` / `builtin` (all three use the same local deterministic implementation). If missing/unsupported, returns standard error shape with HTTP 501.
 ### `POST /v1/files`
@@ -427,9 +438,13 @@ Business auth required. OpenAI Files-compatible upload endpoint; currently only
 Constraints and behavior:
 - `Content-Type` must be `multipart/form-data` (otherwise `400`).
- Total request size limit is `100 MiB` (over-limit returns `413`).
+- Total request size limit is **100 MiB** (over-limit returns `413`).
 - Success returns an OpenAI `file` object (`id/object/bytes/filename/purpose/status`, etc.) and includes `account_id` for source-account tracing.
 ### `GET /v1/files/{file_id}`
 Business auth required. Retrieves the current DeepSeek upload status for a file and returns an OpenAI `file` object. Returns `404` when no matching file is found.
 ---
 ## Claude-Compatible API
@@ -481,6 +496,13 @@ anthropic-version: 2023-06-01
 | `stream` | boolean | ❌ | Default `false` |
 | `system` | string | ❌ | Optional system prompt |
 | `tools` | array | ❌ | Claude tool schema |
 | `thinking` | object | ❌ | Anthropic thinking config; translated into downstream reasoning control, and ignored by `-nothinking` models |
 | `temperature` | number | ❌ | Passed through to the downstream bridge; if `temperature` and `top_p` are both present, `temperature` wins |
 | `top_p` | number | ❌ | Passed through when `temperature` is absent |
 | `stop_sequences` | array | ❌ | Passed through as downstream stop sequences |
 | `tool_choice` | string/object | ❌ | Supports `auto` / `none` / `required` / `{"type":"function","name":"..."}` and is translated to downstream tool choice |
 > Note: `thinking`, `temperature`, `top_p`, `stop_sequences`, and `tool_choice` are translated through the compatibility bridge. Final behavior still depends on the selected model and upstream support. When both `temperature` and `top_p` are present, `temperature` takes precedence.
 #### Non-Stream Response
@@ -533,7 +555,7 @@ data: {"type":"message_stop"}
 **Notes**:
- Models whose names contain `opus` / `reasoner` / `slow` stream `thinking_delta`
+- Models that support thinking emit `thinking` blocks / `thinking_delta` by default; explicit thinking disablement or `-nothinking` models suppress them
 - `signature_delta` is not emitted (DeepSeek does not provide verifiable thinking signatures)
 - In `tools` mode, the stream avoids leaking raw tool JSON and does not force `input_json_delta`
@@ -579,6 +601,7 @@ Request body accepts Gemini-style `contents` / `tools`. Model names can use alia
 Response uses Gemini-compatible fields, including:
 - `candidates[].content.parts[].text`
 - `candidates[].content.parts[].thought=true` for thinking output
 - `candidates[].content.parts[].functionCall` (when tool call is produced)
 - `usageMetadata` (`promptTokenCount` / `candidatesTokenCount` / `totalTokenCount`)
@@ -587,6 +610,7 @@ Response uses Gemini-compatible fields, including:
 Returns SSE (`text/event-stream`), each chunk as `data: <json>`:
 - regular text: incremental text chunks
 - thinking: incremental chunks with `parts[].thought=true`
 - `tools` mode: buffered and emitted as `functionCall` at finalize phase
 - final chunk: includes `finishReason: "STOP"` and `usageMetadata`
 - Token counting prefers pass-through from upstream DeepSeek SSE (`accumulated_token_usage` / `token_usage`), and only falls back to local estimation when upstream usage is absent
@@ -709,7 +733,6 @@ Reads runtime settings and status, including:
 - `success`
 - `admin` (`has_password_hash`, `jwt_expire_hours`, `jwt_valid_after_unix`, `default_password_warning`)
 - `runtime` (`account_max_inflight`, `account_max_queue`, `global_max_inflight`, `token_refresh_interval_hours`)
 - `compat` (`wide_input_strict_output`, `strip_reference_markers`)
 - `responses` / `embeddings`
 - `auto_delete` (`mode`: `none` / `single` / `all`; legacy `sessions=true` is still treated as `all`)
 - `current_input_file` (`enabled` defaults to `true`, plus `min_chars`)
@@ -723,13 +746,11 @@ Hot-updates runtime settings. Supported fields:
 - `admin.jwt_expire_hours`
 - `runtime.account_max_inflight` / `runtime.account_max_queue` / `runtime.global_max_inflight` / `runtime.token_refresh_interval_hours`
 - `compat.wide_input_strict_output` / `compat.strip_reference_markers`
 - `responses.store_ttl_seconds`
 - `embeddings.provider`
 - `auto_delete.mode`
 - `current_input_file.enabled` / `current_input_file.min_chars`
 - `model_aliases`
 - `history_split` is retained only for legacy config compatibility and no longer affects requests
 - `toolcall` policy is fixed and is no longer writable through settings
 ### `POST /admin/settings/password`
@@ -753,9 +774,9 @@ Imports full config with:
 The request can send config directly, or wrapped as `{"config": {...}, "mode":"merge"}`.
 Query params `?mode=merge` / `?mode=replace` are also supported.
-`replace` mode replaces the full config shape while preserving Vercel sync metadata. `merge` mode merges `keys`, `api_keys`, `accounts`, and `model_aliases`, and overwrites non-empty fields under `admin`, `runtime`, `responses`, and `embeddings`. Manage `compat`, `auto_delete`, and `current_input_file` via `/admin/settings` or the config file; `history_split` remains only for legacy compatibility; legacy `toolcall` fields are ignored.
+`replace` mode replaces the full config shape while preserving Vercel sync metadata. `merge` mode merges `keys`, `api_keys`, `accounts`, and `model_aliases`, and overwrites non-empty fields under `admin`, `runtime`, `responses`, and `embeddings`. Manage `auto_delete` and `current_input_file` via `/admin/settings` or the config file; legacy `compat` and `toolcall` fields are ignored.
-> Note: `merge` mode does not update `compat`, `auto_delete`, or `current_input_file`.
+> Note: `merge` mode does not update `auto_delete` or `current_input_file`.
 ### `GET /admin/config/export`
@@ -917,12 +938,15 @@ Updates proxy binding for a specific account.
  "message": "API test successful (session creation only)",
  "model": "deepseek-v4-flash",
  "session_count": 0,
-  "config_writable": true
+  "config_writable": true,
  "config_warning": ""
 }
 ```
 If a `message` is provided, `thinking` may also be included when the upstream response carries reasoning text.
 When the configured file path is not writable (for example, read-only `/app/config.json` inside some containers), login/session testing still proceeds; `config_warning` is returned to indicate token persistence failed and the token is memory-only until restart.
 ### `POST /admin/accounts/test-all`
 Optional request field: `model`.
@@ -1206,7 +1230,7 @@ Clients should handle HTTP status code plus `error` / `detail` fields.
 | Code | Meaning |
 | --- | --- |
 | `401` | Authentication failed (invalid key/token, or expired admin JWT) |
-| `429` | Too many requests (exceeded inflight + queue capacity) |
+| `429` | Too many requests (exceeded inflight + queue capacity; current responses do not include `Retry-After`) |
 | `503` | Model unavailable or upstream error |
 ---
--- a/API.md
+++ b/API.md
@@ -33,12 +33,16 @@
 | 健康检查 | `GET /healthz`、`GET /readyz` |
 | CORS | 已启用（统一覆盖 `/v1/*`、`/anthropic/*`、`/v1beta/models/*`、`/admin/*`；浏览器有 `Origin` 时回显该 Origin，否则为 `*`；默认允许 `Content-Type`, `Authorization`, `X-API-Key`, `X-Ds2-Target-Account`, `X-Ds2-Source`, `X-Vercel-Protection-Bypass`, `X-Goog-Api-Key`, `Anthropic-Version`, `Anthropic-Beta`，并会放行预检里声明的第三方请求头，如 `x-stainless-*`；Vercel 上 `/v1/chat/completions` 的 Node Runtime 也对齐相同行为；内部专用头 `X-Ds2-Internal-Token` 仍被拦截） |
 - 所有 JSON 请求体都必须是合法 UTF-8；非法字节序列会在入站阶段被拒绝为 `400 invalid json`。
 ### 3.0 接口适配层说明
 - OpenAI / Claude / Gemini 三套协议已统一挂在同一 `chi` 路由树上，由 `internal/server/router.go` 负责装配。
 - 适配器层职责收敛为：**请求归一化 → DeepSeek 调用 → 协议形态渲染**，减少历史版本中“同能力多处实现”的分叉。
 - Tool Calling 的解析策略在 Go 与 Node Runtime 间保持一致：推荐模型输出 DSML 外壳 `<|DSML|tool_calls>` → `<|DSML|invoke name="...">` → `<|DSML|parameter name="...">`；兼容层也接受 DSML wrapper 别名 `<dsml|tool_calls>`、`<|tool_calls>`、`<｜tool_calls>`、常见 DSML 分隔符漏写形态（如 `<|DSML tool_calls>`）、`DSML` 与工具标签名黏连的常见 typo（如 `<DSMLtool_calls>`），以及旧式 canonical XML `<tool_calls>` → `<invoke name="...">` → `<parameter name="...">`。实现上采用窄容错结构扫描：只有 `tool_calls` wrapper 或可修复的缺失 opening wrapper 会进入工具路径，裸 `<invoke>` 不计为已支持语法；流式场景继续执行防泄漏筛分。若参数体本身是合法 JSON 字面量（如 `123`、`true`、`null`、数组或对象），会按结构化值输出，不再一律当作字符串；若 CDATA 偶发漏闭合，则会在最终 parse / flush 恢复阶段做窄修复，尽量保住已完整包裹的外层工具调用。
 - `Admin API` 将配置与运行时策略分开：`/admin/config*` 管静态配置，`/admin/settings*` 管运行时行为。
 - 当上游返回 thinking-only 响应（模型输出了推理链但无可见文本）时，非流式补全会自动重试一次：以多轮对话 follow-up 方式追加 prompt 后缀 `"Previous reply had no visible output. Please regenerate the visible final answer or tool call now."` 并设置 `parent_message_id` 在同一 DeepSeek session 内让模型重新输出；重试最大 1 次。
 - 引用标记处理边界：流式输出默认隐藏 `[citation:N]` / `[reference:N]` 这类上游内部占位符；非流式输出默认把 DeepSeek 搜索引用标记转换为 Markdown 引用链接。
 ---
@@ -81,7 +85,7 @@ Vercel 一键部署可先只填 `DS2API_ADMIN_KEY`，部署后在 `/admin` 导
 - token 在 `config.keys` 中 → **托管账号模式**，自动轮询选择账号
 - token 不在 `config.keys` 中 → **直通 token 模式**，直接作为 DeepSeek token 使用
-**可选请求头**：`X-Ds2-Target-Account: <email_or_mobile>` — 指定使用某个托管账号。
+**可选请求头**：`X-Ds2-Target-Account: <email_or_mobile>` — 指定使用某个托管账号；如果目标账号不存在，或管理账号队列已耗尽，相关业务请求会返回 `429`，当前不会附带 `Retry-After` 头。若账号存在但登录/刷新失败，则返回对应的 `401` 或上游错误。
 Gemini 兼容客户端还可以使用 `x-goog-api-key`、`?key=` 或 `?api_key=` 作为凭据来源。
 ### Admin 接口（`/admin/*`）
@@ -109,6 +113,7 @@ Gemini 兼容客户端还可以使用 `x-goog-api-key`、`?key=` 或 `?api_key=`
 | GET | `/v1/responses/{response_id}` | 业务 | 查询已生成 response（内存 TTL） |
 | POST | `/v1/embeddings` | 业务 | OpenAI Embeddings 接口 |
 | POST | `/v1/files` | 业务 | OpenAI Files 上传（multipart/form-data） |
 | GET | `/v1/files/{file_id}` | 业务 | 查询已上传文件状态 |
 | GET | `/anthropic/v1/models` | 无 | Claude 模型列表 |
 | POST | `/anthropic/v1/messages` | 业务 | Claude 消息接口 |
 | POST | `/anthropic/v1/messages/count_tokens` | 业务 | Claude token 计数 |
@@ -163,8 +168,12 @@ Gemini 兼容客户端还可以使用 `x-goog-api-key`、`?key=` 或 `?api_key=`
 | GET | `/admin/chat-history/{id}` | Admin | 查看单条服务器端对话记录 |
 | DELETE | `/admin/chat-history/{id}` | Admin | 删除单条服务器端对话记录 |
 | PUT | `/admin/chat-history/settings` | Admin | 更新对话记录保留条数 |
 服务器端记录本质上是 DeepSeek 上游响应归档：OpenAI Chat、OpenAI Responses、Claude Messages、Gemini GenerateContent 等直连 DeepSeek 的生成接口，在收到上游响应后会于各协议回译/裁剪前写入记录；列表按请求创建时间倒序展示，流式请求会在生成过程中持续刷新状态与详情。WebUI「API 测试」发出的请求也会进入该记录。
 | GET | `/admin/version` | Admin | 查询当前版本与最新 Release |
 OpenAI `/v1/*` 仍是规范路径。对于只配置 DS2API 根地址的客户端，同一套 OpenAI handler 也通过根路径快捷路由暴露：`/models`、`/models/{id}`、`/chat/completions`、`/responses`、`/responses/{response_id}`、`/embeddings`、`/files`、`/files/{file_id}`。
 ---
 ## 健康检查
@@ -204,9 +213,7 @@ Gemini 兼容客户端还可以使用 `x-goog-api-key`、`?key=` 或 `?api_key=`
    {"id": "deepseek-v4-pro-search", "object": "model", "created": 1677610602, "owned_by": "deepseek", "permission": []},
    {"id": "deepseek-v4-pro-search-nothinking", "object": "model", "created": 1677610602, "owned_by": "deepseek", "permission": []},
    {"id": "deepseek-v4-vision", "object": "model", "created": 1677610602, "owned_by": "deepseek", "permission": []},
-    {"id": "deepseek-v4-vision-nothinking", "object": "model", "created": 1677610602, "owned_by": "deepseek", "permission": []},
+    {"id": "deepseek-v4-vision-nothinking", "object": "model", "created": 1677610602, "owned_by": "deepseek", "permission": []}
    {"id": "deepseek-v4-vision-search", "object": "model", "created": 1677610602, "owned_by": "deepseek", "permission": []},
    {"id": "deepseek-v4-vision-search-nothinking", "object": "model", "created": 1677610602, "owned_by": "deepseek", "permission": []}
  ]
 }
 ```
@@ -232,6 +239,7 @@ Gemini 兼容客户端还可以使用 `x-goog-api-key`、`?key=` 或 `?api_key=`
 - 其他兼容族：`llama-*`、`qwen-*`、`mistral-*`、`command-*` 会按家族启发式回退
 上述 alias 若在请求名后追加 `-nothinking` 后缀，也会映射到对应的强制关闭 thinking 版本。
 当前视觉能力仅对应 `deepseek-v4-vision` / `deepseek-v4-vision-nothinking`，不会解析出独立的 `vision-search` 变体。
 退役历史模型（如 `claude-1.*`、`claude-2.*`、`claude-instant-*`、`gpt-3.5*`）会被显式拒绝。
@@ -306,7 +314,7 @@ data: [DONE]
 - 开启 thinking 时会输出 `delta.reasoning_content`
 - 普通文本输出 `delta.content`
 - 最后一段包含 `finish_reason` 和 `usage`
- token 计数优先透传上游 DeepSeek SSE（如 `accumulated_token_usage` / `token_usage`）；仅在上游缺失时回退本地估算
+- token 计数优先透传上游 DeepSeek SSE（如 `accumulated_token_usage` / `token_usage`）；仅在上游缺失时回退本地估算。失败/中断型结束（例如 `response.failed`）可能不会携带 `usage`
 #### Tool Calls
@@ -423,7 +431,7 @@ data: [DONE]
 | `model` | string | ✅ | 支持原生模型 + alias 自动映射 |
 | `input` | string/array | ✅ | 支持字符串、字符串数组、token 数组 |
-> 需配置 `embeddings.provider`。当前支持：`mock` / `deterministic` / `builtin`。未配置或不支持时返回标准错误结构（HTTP 501）。
+> 需配置 `embeddings.provider`。当前支持：`mock` / `deterministic` / `builtin`（三者都走同一套本地确定性实现）。未配置或不支持时返回标准错误结构（HTTP 501）。
 ### `POST /v1/files`
@@ -437,9 +445,13 @@ data: [DONE]
 约束与行为：
 - 请求必须为 `multipart/form-data`，否则返回 `400`。
- 请求体总大小上限 `100 MiB`（超限返回 `413`）。
+- 请求体总大小上限 **100 MiB**（超限返回 `413`）。
 - 成功返回 OpenAI `file` 对象（`id/object/bytes/filename/purpose/status` 等字段），并附带 `account_id` 便于定位来源账号。
 ### `GET /v1/files/{file_id}`
 需要业务鉴权。查询 DeepSeek 上传文件的当前状态，并返回 OpenAI `file` 对象；未找到匹配文件时返回 `404`。
 ---
 ## Claude 兼容接口
@@ -494,6 +506,13 @@ anthropic-version: 2023-06-01
 | `stream` | boolean | ❌ | 默认 `false` |
 | `system` | string | ❌ | 可选系统提示 |
 | `tools` | array | ❌ | Claude tool 定义 |
 | `thinking` | object | ❌ | Anthropic thinking 配置；会转译为下游 reasoning 控制，`-nothinking` 模型会忽略 |
 | `temperature` | number | ❌ | 透传到下游；若同时提供 `top_p`，以 `temperature` 为准 |
 | `top_p` | number | ❌ | 当未提供 `temperature` 时透传到下游 |
 | `stop_sequences` | array | ❌ | 透传到下游停用序列 |
 | `tool_choice` | string/object | ❌ | 支持 `auto` / `none` / `required` / `{"type":"function","name":"..."}`，并会转译为下游工具选择 |
 > 说明：上述 `thinking`、`temperature`、`top_p`、`stop_sequences`、`tool_choice` 都会走兼容层转译；最终是否生效仍取决于当前模型和上游能力。`temperature` 与 `top_p` 同时存在时，`temperature` 优先。
 #### 非流式响应
@@ -546,7 +565,7 @@ data: {"type":"message_stop"}
 **说明**：
- 默认模型会按各 surface 的既有规则输出 thinking / reasoning 相关增量
+- 默认支持 thinking 的模型会输出 `thinking` block / `thinking_delta`；请求显式关闭 thinking 或使用 `-nothinking` 模型时不会输出
 - 带 `-nothinking` 后缀的模型会强制关闭 thinking，即使请求显式传了 `thinking` / `reasoning` / `reasoning_effort` 也不会输出 `thinking_delta`
 - 不会输出 `signature_delta`（上游 DeepSeek 未提供可验证签名）
 - `tools` 场景优先避免泄露原始工具 JSON，不强制发送 `input_json_delta`
@@ -593,6 +612,7 @@ data: {"type":"message_stop"}
 响应为 Gemini 兼容结构，核心字段包括：
 - `candidates[].content.parts[].text`
 - `candidates[].content.parts[].thought=true`（thinking 输出）
 - `candidates[].content.parts[].functionCall`（工具调用时）
 - `usageMetadata`（`promptTokenCount` / `candidatesTokenCount` / `totalTokenCount`）
@@ -601,6 +621,7 @@ data: {"type":"message_stop"}
 返回 SSE（`text/event-stream`），每个 chunk 为一条 `data: <json>`：
 - 常规文本：持续返回增量文本 chunk
 - thinking：持续返回 `parts[].thought=true` 的增量 chunk
 - `tools` 场景：会缓冲并在结束时输出 `functionCall` 结构
 - 结束 chunk：包含 `finishReason: "STOP"` 与 `usageMetadata`
 - token 计数优先透传上游 DeepSeek SSE（如 `accumulated_token_usage` / `token_usage`）；仅在上游缺失时回退本地估算
@@ -723,7 +744,6 @@ data: {"type":"message_stop"}
 - `success`
 - `admin`（`has_password_hash`、`jwt_expire_hours`、`jwt_valid_after_unix`、`default_password_warning`）
 - `runtime`（`account_max_inflight`、`account_max_queue`、`global_max_inflight`、`token_refresh_interval_hours`）
 - `compat`（`wide_input_strict_output`、`strip_reference_markers`）
 - `responses` / `embeddings`
 - `auto_delete`（`mode`：`none` / `single` / `all`；旧配置 `sessions=true` 仍按 `all` 处理）
 - `current_input_file`（`enabled` 默认返回 `true`、`min_chars`）
@@ -737,13 +757,11 @@ data: {"type":"message_stop"}
 - `admin.jwt_expire_hours`
 - `runtime.account_max_inflight` / `runtime.account_max_queue` / `runtime.global_max_inflight` / `runtime.token_refresh_interval_hours`
 - `compat.wide_input_strict_output` / `compat.strip_reference_markers`
 - `responses.store_ttl_seconds`
 - `embeddings.provider`
 - `auto_delete.mode`
 - `current_input_file.enabled` / `current_input_file.min_chars`
 - `model_aliases`
 - `history_split` 仅作为旧配置兼容字段保留，不再影响请求处理
 - `toolcall` 策略已固定，不再作为可写入字段
 ### `POST /admin/settings/password`
@@ -767,9 +785,9 @@ data: {"type":"message_stop"}
 请求可直接传配置对象，或使用 `{"config": {...}, "mode":"merge"}` 包裹格式。
 也支持在查询参数里传 `?mode=merge` / `?mode=replace`。
-`replace` 模式会按完整配置结构替换（保留 Vercel 同步元信息）；`merge` 模式会合并 `keys`、`api_keys`、`accounts`、`model_aliases`，并覆盖 `admin`、`runtime`、`responses`、`embeddings` 中的非空字段。`compat`、`auto_delete`、`current_input_file` 建议通过 `/admin/settings` 或配置文件管理；`history_split` 仅保留为旧配置兼容字段；`toolcall` 相关字段会被忽略。
+`replace` 模式会按完整配置结构替换（保留 Vercel 同步元信息）；`merge` 模式会合并 `keys`、`api_keys`、`accounts`、`model_aliases`，并覆盖 `admin`、`runtime`、`responses`、`embeddings` 中的非空字段。`auto_delete`、`current_input_file` 建议通过 `/admin/settings` 或配置文件管理；`compat` 与 `toolcall` 相关字段会被忽略。
-> 注意：`merge` 模式不会更新 `compat`、`auto_delete`、`current_input_file`。
+> 注意：`merge` 模式不会更新 `auto_delete`、`current_input_file`。
 ### `GET /admin/config/export`
@@ -934,12 +952,15 @@ data: {"type":"message_stop"}
  "message": "API 测试成功（仅会话创建）",
  "model": "deepseek-v4-flash",
  "session_count": 0,
-  "config_writable": true
+  "config_writable": true,
  "config_warning": ""
 }
 ```
 如果传入 `message`，还会附带 `thinking`（当上游返回思考内容时）。
 当部署环境配置文件路径不可写（例如容器内默认 `/app/config.json` 只读）时，登录与会话测试仍可继续；此时会返回 `config_warning` 提示 token 仅保存在内存、重启后丢失。
 ### `POST /admin/accounts/test-all`
 可选请求字段：`model`
@@ -1222,7 +1243,7 @@ Gemini 路由使用 Google 风格错误结构：
 | 状态码 | 说明 |
 | --- | --- |
 | `401` | 鉴权失败（key/token 无效，或 Admin JWT 过期） |
-| `429` | 请求过多（超出并发上限 + 等待队列） |
+| `429` | 请求过多（超出并发上限 + 等待队列；当前不附带 `Retry-After` 头） |
 | `503` | 模型不可用或上游服务异常 |
 ---
--- a/2
+++ b/2
@@ -29,7 +29,7 @@ WORKDIR /app
 RUN apt-get update \
    && apt-get install -y --no-install-recommends ca-certificates \
    && groupadd -r ds2api && useradd -r -g ds2api -d /app -s /sbin/nologin ds2api \
-    && mkdir -p /app/data && chown -R ds2api:ds2api /app \
+    && mkdir -p /app/data /data && chown -R ds2api:ds2api /app /data \
    && rm -rf /var/lib/apt/lists/*
 COPY --from=busybox-tools /bin/busybox /usr/local/bin/busybox
 EXPOSE 5001
--- a/README.MD
+++ b/README.MD
@@ -17,12 +17,22 @@
 语言 / Language: [中文](README.MD) | [English](README.en.md)
-将 DeepSeek Web 对话能力转换为 OpenAI、Claude 与 Gemini 兼容 API。后端为 **Go 全量实现**，前端为 React WebUI 管理台（源码在 `webui/`，部署时自动构建到 `static/admin`）。
+将 DeepSeek Web 对话能力转换为 OpenAI、Claude 与 Gemini 兼容 API。核心后端以 **Go** 实现，Vercel 流式桥接额外使用少量 Node Runtime，前端为 React WebUI 管理台（源码在 `webui/`，部署时自动构建到 `static/admin`）。
 文档入口：[文档导航](docs/README.md) / [架构说明](docs/ARCHITECTURE.md) / [接口文档](API.md)
 【感谢Linux.do社区及GitHub社区各位开发者对项目的支持与贡献】
 ## Star History
 <a href="https://www.star-history.com/?repos=cjackhwang%2Fds2api&type=date&legend=top-left">
 <picture>
   <source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/chart?repos=cjackhwang/ds2api&type=date&theme=dark&legend=top-left" />
   <source media="(prefers-color-scheme: light)" srcset="https://api.star-history.com/chart?repos=cjackhwang/ds2api&type=date&legend=top-left" />
   <img alt="Star History Chart" src="https://api.star-history.com/chart?repos=cjackhwang/ds2api&type=date&legend=top-left" />
 </picture>
 </a>
 > **重要免责声明**
 >
 > 本仓库仅供学习、研究、个人实验和内部验证使用，不提供任何形式的商业授权、适用性保证或结果保证。
@@ -76,13 +86,14 @@ flowchart LR
        subgraph Runtime["运行时核心能力"]
            Compat["PromptCompat\n(API -> 网页纯文本上下文)"]
-            Chat["Chat / Responses Runtime\n(统一工具调用与流式语义)"]
+            Completion["Completion Runtime\n(Session / PoW / Completion)"]
            Turn["AssistantTurn\n(输出语义归一)"]
            Auth["Auth Resolver\n(API key / bearer / x-goog-api-key)"]
            Pool["Account Pool + Queue\n(并发槽位 + 等待队列)"]
            DSClient["DeepSeek Client\n(Session / Auth / Completion / Files)"]
            Pow["PoW 实现\n(纯 Go)"]
            Tool["Tool Sieve\n(Go/Node 语义对齐)"]
-            History["History Split\n(长历史文件化)"]
+            History["Current Input File\n(DS2API_HISTORY.txt)"]
        end
    end
@@ -94,18 +105,19 @@ flowchart LR
    OA --> Compat
    CA & GA --> Compat
-    Compat --> Chat
+    Compat --> Completion
-    Compat -.长历史.-> History
+    Completion -.完整上下文.-> History
-    Vercel -.Go prepare.-> Chat
+    Completion --> Turn
    Vercel -.Go prepare.-> Completion
    Vercel -.Node SSE.-> Tool
-    Chat --> Auth
+    Completion --> Auth
-    Chat -.账号轮询.-> Pool
+    Completion -.账号轮询.-> Pool
-    Chat -.工具调用解析.-> Tool
+    Completion -.工具调用解析.-> Tool
-    Chat -.PoW 计算.-> Pow
+    Completion -.PoW 计算.-> Pow
    Auth --> DSClient
    DSClient --> Upstream
    Upstream --> DSClient
-    Chat --> Client
+    Turn --> Client
    Vercel --> Client
 ```
@@ -119,7 +131,7 @@ flowchart LR
 | 能力 | 说明 |
 | --- | --- |
-| OpenAI 兼容 | `GET /v1/models`、`GET /v1/models/{id}`、`POST /v1/chat/completions`、`POST /v1/responses`、`GET /v1/responses/{response_id}`、`POST /v1/embeddings`、`POST /v1/files` |
+| OpenAI 兼容 | `GET /v1/models`、`GET /v1/models/{id}`、`POST /v1/chat/completions`、`POST /v1/responses`、`GET /v1/responses/{response_id}`、`POST /v1/embeddings`、`POST /v1/files`、`GET /v1/files/{file_id}` |
 | Claude 兼容 | `GET /anthropic/v1/models`、`POST /anthropic/v1/messages`、`POST /anthropic/v1/messages/count_tokens`（及快捷路径 `/v1/messages`、`/messages`） |
 | Gemini 兼容 | `POST /v1beta/models/{model}:generateContent`、`POST /v1beta/models/{model}:streamGenerateContent`（及 `/v1/models/{model}:*` 路径） |
 | 统一 CORS 兼容 | `/v1/*`、`/anthropic/*`、`/v1beta/models/*`、`/admin/*` 统一走同一套 CORS 策略；Vercel 上 `/v1/chat/completions` 的 Node Runtime 也对齐相同放行规则，尽量减少第三方预检请求头限制 |
@@ -131,6 +143,8 @@ flowchart LR
 | WebUI 管理台 | `/admin` 单页应用（中英文双语、深色模式，支持查看服务器端对话记录） |
 | 运维探针 | `GET /healthz`（存活）、`GET /readyz`（就绪） |
 OpenAI `/v1/*` 仍是推荐的规范路径；同时支持 `/models`、`/chat/completions`、`/responses`、`/embeddings`、`/files`、`/files/{file_id}` 等根路径快捷路由，方便只配置 DS2API 根地址的第三方客户端。
 ## 平台兼容矩阵
 | 级别 | 平台 | 当前状态 |
@@ -158,10 +172,9 @@ flowchart LR
 | expert | `deepseek-v4-pro-search-nothinking` | 永久关闭，不受请求参数影响 | ✅ |
 | vision | `deepseek-v4-vision` | 默认开启，可由请求参数控制 | ❌ |
 | vision | `deepseek-v4-vision-nothinking` | 永久关闭，不受请求参数影响 | ❌ |
 | vision | `deepseek-v4-vision-search` | 默认开启，可由请求参数控制 | ✅ |
 | vision | `deepseek-v4-vision-search-nothinking` | 永久关闭，不受请求参数影响 | ✅ |
 除原生模型外，也支持常见 alias 输入（如 `gpt-4.1`、`gpt-5`、`gpt-5-codex`、`o3`、`claude-*`、`gemini-*` 等），但 `/v1/models` 返回的是规范化后的 DeepSeek 原生模型 ID。若 alias 名本身追加 `-nothinking` 后缀，也会映射到对应的强制关思考模型。完整 alias 行为以 [API.md](API.md#模型-alias-解析策略) 和 `config.example.json` 为准。
 当前上游视觉模型只暴露 `vision` 通道，不提供独立的联网搜索视觉变体。
 ### Claude 接口（`GET /anthropic/v1/models`）
@@ -245,6 +258,8 @@ docker-compose logs -f
 ```
 默认 `docker-compose.yml` 会把宿主机 `6011` 映射到容器内的 `5001`。如果你希望直接对外暴露 `5001`，请设置 `DS2API_HOST_PORT=5001`（或者手动调整 `ports` 配置）。
 同时默认把 `./config.json` 挂载到容器 `/data/config.json`，并设置 `DS2API_CONFIG_PATH=/data/config.json`，用于避免 `/app` 只读导致运行时 token 持久化失败。
 镜像会预创建 `/data` 并授权给非 root 的 `ds2api` 用户；如果使用单文件 bind mount，请确保宿主机 `config.json` 对容器用户可读写，例如 `chmod 644 config.json`。
 更新镜像：`docker-compose up -d --build`
@@ -254,6 +269,10 @@ docker-compose logs -f
 2. 部署完成后访问 `/admin`，使用 Zeabur 环境变量/模板指引中的 `DS2API_ADMIN_KEY` 登录。
 3. 在管理台导入/编辑配置（会写入并持久化到 `/data/config.json`）。
 Zeabur 首次空卷启动时可以没有 `/data/config.json`；DS2API 会先使用空的文件模式配置启动，并在管理台首次保存时创建该文件。
 不依赖模板手动部署时，在 Zeabur 中选择 GitHub 仓库服务，Root Directory 保持 `/`，使用仓库根目录 `Dockerfile` 构建；添加持久卷 `/data`，设置 `PORT=5001`、`DS2API_ADMIN_KEY=你的强密钥`、`DS2API_CONFIG_PATH=/data/config.json`，然后暴露 HTTP 端口 `5001`。更完整步骤见 [docs/DEPLOY.md](docs/DEPLOY.md#不使用模板手动部署)。
 说明：Zeabur 使用仓库内 `Dockerfile` 直接构建时，不需要额外传入 `BUILD_VERSION`；镜像会优先读取该构建参数，未提供时自动回退到仓库根目录的 `VERSION` 文件。
 ### 方式三：Vercel 部署
@@ -282,7 +301,7 @@ base64 < config.json | tr -d '\n'
 ### 方式四：本地源码运行
-**前置要求**：Go 1.26+，Node.js `20.19+` 或 `22.12+`（仅在需要构建 WebUI 时）
+**前置要求**：Go 1.26+，Node.js `20.19+` 或 `22.12+`（仅在需要构建 WebUI 时）；同时确保 `npm` 可用，建议 `npm 10+`
 ```bash
 # 1. 克隆仓库
@@ -301,7 +320,7 @@ go run ./cmd/ds2api
 服务实际绑定：`0.0.0.0:5001`，因此同一局域网设备通常也可以通过你的内网 IP 访问。
-> **WebUI 自动构建**：本地首次启动时，若 `static/admin` 不存在，会自动尝试执行 `npm ci`（仅在缺少依赖时）和 `npm run build -- --outDir static/admin --emptyOutDir`（需要本机有 Node.js）。你也可以手动构建：`./scripts/build-webui.sh`
+> **WebUI 自动构建**：本地首次启动时，若 `static/admin` 不存在，会自动尝试执行 `npm ci`（仅在缺少依赖时）和 `npm run build -- --outDir static/admin --emptyOutDir`（需要本机有 Node.js 和 npm）。你也可以手动构建：`./scripts/build-webui.sh`
 ## 配置说明
@@ -314,8 +333,7 @@ go run ./cmd/ds2api
 - `model_aliases`：OpenAI / Claude / Gemini 共用的模型 alias 映射。
 - `runtime`：账号并发、队列与 token 刷新策略，可通过 Admin Settings 热更新。
 - `auto_delete.mode`：请求结束后的远端会话清理策略，支持 `none` / `single` / `all`。
- `history_split`：旧轮次拆分字段，已废弃并忽略，仅保留兼容旧配置。
+- `current_input_file`：全局生效的上下文拆分上传策略；默认开启且阈值为 `0`，触发时将完整上下文合并上传为 `DS2API_HISTORY.txt` 上下文文件。
 - `current_input_file`：唯一生效的独立拆分策略；默认开启且阈值为 `0`，触发时将完整上下文合并上传为隐藏上下文文件。
 - 如果关闭 `current_input_file`，请求会直接透传，不上传拆分上下文文件。
 - `thinking_injection`：默认开启；在最新 user 消息末尾追加思考增强提示词，提高高强度推理与工具调用前的思考稳定性；`prompt` 留空时使用内置默认提示词。
@@ -331,6 +349,7 @@ go run ./cmd/ds2api
 | **直通 token 模式** | 传入 token 不在 `config.keys` 中时，直接作为 DeepSeek token 使用 |
 可选请求头 `X-Ds2-Target-Account`：指定使用某个托管账号（值为 email 或 mobile）。
 如果指定账号不存在，或者当前管理账号队列已满，请求会返回 `429`；当前 `429` 不附带 `Retry-After` 头。若账号存在但登录/刷新失败，则返回对应的鉴权错误。
 Gemini 路由还可以使用 `x-goog-api-key`，或在没有认证头时使用 `?key=` / `?api_key=` 作为调用方凭据。
 ## 并发模型
@@ -343,7 +362,7 @@ Gemini 路由还可以使用 `x-goog-api-key`，或在没有认证头时使用 `
 ```
 - 当 in-flight 槽位满时，请求进入等待队列，**不会立即 429**
- 超出总承载上限后才返回 `429 Too Many Requests`
+- 超出总承载上限后才返回 `429 Too Many Requests`，当前响应不附带 `Retry-After`
 - `GET /admin/queue/status` 返回实时并发状态
 ## Tool Call 适配
@@ -421,10 +440,10 @@ npm run build --prefix webui
 工作流文件：`.github/workflows/release-artifacts.yml`
- **触发条件**：仅在 GitHub Release `published` 时触发（普通 push 不会触发）
+- **触发条件**：默认仅在 GitHub Release `published` 时自动触发；也支持在 Actions 页面手动 `workflow_dispatch`，并填写 `release_tag` 复跑/补发
- **构建产物**：多平台二进制包（`linux/amd64`、`linux/arm64`、`linux/armv7`、`darwin/amd64`、`darwin/arm64`、`windows/amd64`、`windows/arm64`）+ `sha256sums.txt`
+- **构建产物**：多平台二进制包（`linux/amd64`、`linux/arm64`、`linux/armv7`、`darwin/amd64`、`darwin/arm64`、`windows/amd64`、`windows/arm64`）、Linux Docker 镜像导出包 + `sha256sums.txt`
 - **容器镜像发布**：仅推送到 GHCR（`ghcr.io/cjackhwang/ds2api`）
- **每个压缩包包含**：`ds2api` 可执行文件、`static/admin`、WASM 文件（同时支持内置 fallback）、`config.example.json` 配置示例、README、LICENSE
+- **每个二进制压缩包包含**：`ds2api` 可执行文件、`static/admin`、`config.example.json`、`.env.example`、`README.MD`、`README.en.md`、`LICENSE`
 ## 免责声明
--- a/README.en.md
+++ b/README.en.md
@@ -16,10 +16,20 @@
 Language: [中文](README.MD) | [English](README.en.md)
-DS2API converts DeepSeek Web chat capability into OpenAI-compatible, Claude-compatible, and Gemini-compatible APIs. The backend is a **pure Go implementation**, with a React WebUI admin panel (source in `webui/`, build output auto-generated to `static/admin` during deployment).
+DS2API converts DeepSeek Web chat capability into OpenAI-compatible, Claude-compatible, and Gemini-compatible APIs. The core backend is Go-based, with a small Node Runtime bridge used for Vercel streaming, and the React WebUI admin panel lives in `webui/` (build output auto-generated to `static/admin` during deployment).
 Documentation entry: [Docs Index](docs/README.md) / [Architecture](docs/ARCHITECTURE.en.md) / [API Reference](API.en.md)
 ## Star History
 <a href="https://www.star-history.com/?repos=cjackhwang%2Fds2api&type=date&legend=top-left">
 <picture>
   <source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/chart?repos=cjackhwang/ds2api&type=date&theme=dark&legend=top-left" />
   <source media="(prefers-color-scheme: light)" srcset="https://api.star-history.com/chart?repos=cjackhwang/ds2api&type=date&legend=top-left" />
   <img alt="Star History Chart" src="https://api.star-history.com/chart?repos=cjackhwang/ds2api&type=date&legend=top-left" />
 </picture>
 </a>
 > **Important Disclaimer**
 >
 > This repository is provided for learning, research, personal experimentation, and internal validation only. It does not grant any commercial authorization and comes with no warranty of fitness, stability, or results.
@@ -73,13 +83,14 @@ flowchart LR
        subgraph Runtime["Runtime + Core Capabilities"]
            Compat["PromptCompat\n(API -> web-chat plain text context)"]
-            Chat["Chat / Responses Runtime\n(unified tools + stream semantics)"]
+            Completion["Completion Runtime\n(session / PoW / completion)"]
            Turn["AssistantTurn\n(output semantic normalization)"]
            Auth["Auth Resolver\n(API key / bearer / x-goog-api-key)"]
            Pool["Account Pool + Queue\n(in-flight slots + wait queue)"]
            DSClient["DeepSeek Client\n(session / auth / completion / files)"]
            Pow["PoW Solver\n(Pure Go)"]
            Tool["Tool Sieve\n(Go/Node semantic parity)"]
-            History["History Split\n(long history as files)"]
+            History["Current Input File\n(DS2API_HISTORY.txt)"]
        end
    end
@@ -91,18 +102,19 @@ flowchart LR
    OA --> Compat
    CA & GA --> Compat
-    Compat --> Chat
+    Compat --> Completion
-    Compat -.long history.-> History
+    Completion -.full context.-> History
-    Vercel -.Go prepare.-> Chat
+    Completion --> Turn
    Vercel -.Go prepare.-> Completion
    Vercel -.Node SSE.-> Tool
-    Chat --> Auth
+    Completion --> Auth
-    Chat -.account rotation.-> Pool
+    Completion -.account rotation.-> Pool
-    Chat -.tool-call parsing.-> Tool
+    Completion -.tool-call parsing.-> Tool
-    Chat -.PoW solving.-> Pow
+    Completion -.PoW solving.-> Pow
    Auth --> DSClient
    DSClient --> Upstream
    Upstream --> DSClient
-    Chat --> Client
+    Turn --> Client
    Vercel --> Client
 ```
@@ -116,7 +128,7 @@ For the full module-by-module architecture and directory responsibilities, see [
 | Capability | Details |
 | --- | --- |
-| OpenAI compatible | `GET /v1/models`, `GET /v1/models/{id}`, `POST /v1/chat/completions`, `POST /v1/responses`, `GET /v1/responses/{response_id}`, `POST /v1/embeddings`, `POST /v1/files` |
+| OpenAI compatible | `GET /v1/models`, `GET /v1/models/{id}`, `POST /v1/chat/completions`, `POST /v1/responses`, `GET /v1/responses/{response_id}`, `POST /v1/embeddings`, `POST /v1/files`, `GET /v1/files/{file_id}` |
 | Claude compatible | `GET /anthropic/v1/models`, `POST /anthropic/v1/messages`, `POST /anthropic/v1/messages/count_tokens` (plus shortcut paths `/v1/messages`, `/messages`) |
 | Gemini compatible | `POST /v1beta/models/{model}:generateContent`, `POST /v1beta/models/{model}:streamGenerateContent` (plus `/v1/models/{model}:*` paths) |
 | Unified CORS compatibility | `/v1/*`, `/anthropic/*`, `/v1beta/models/*`, and `/admin/*` share one CORS policy; on Vercel, the Node Runtime for `/v1/chat/completions` mirrors the same relaxed preflight behavior for third-party clients |
@@ -128,6 +140,8 @@ For the full module-by-module architecture and directory responsibilities, see [
 | WebUI Admin Panel | SPA at `/admin` (bilingual Chinese/English, dark mode, with server-side conversation history) |
 | Health Probes | `GET /healthz` (liveness), `GET /readyz` (readiness) |
 OpenAI `/v1/*` routes remain canonical, and DS2API also accepts root shortcuts such as `/models`, `/chat/completions`, `/responses`, `/embeddings`, `/files`, and `/files/{file_id}` for clients configured with the bare service URL.
 ## Platform Compatibility Matrix
 | Tier | Platform | Status |
@@ -150,9 +164,9 @@ For the full module-by-module architecture and directory responsibilities, see [
 | default | `deepseek-v4-flash-search` | enabled by default, request-controlled | ✅ |
 | expert | `deepseek-v4-pro-search` | enabled by default, request-controlled | ✅ |
 | vision | `deepseek-v4-vision` | enabled by default, request-controlled | ❌ |
 | vision | `deepseek-v4-vision-search` | enabled by default, request-controlled | ✅ |
 Besides native IDs, DS2API also accepts common aliases as input (for example `gpt-4.1`, `gpt-5`, `gpt-5-codex`, `o3`, `claude-*`, `gemini-*`), but `/v1/models` returns normalized DeepSeek native model IDs. The complete alias behavior is documented in [API.en.md](API.en.md#model-alias-resolution) and `config.example.json`.
 Current upstream vision support exposes only the `vision` lane and does not provide a separate search-enabled vision variant.
 ### Claude Endpoint (`GET /anthropic/v1/models`)
@@ -233,6 +247,7 @@ docker-compose up -d
 ```
 The default `docker-compose.yml` uses `ghcr.io/cjackhwang/ds2api:latest` and maps host port `6011` to container port `5001`. If you want `5001` exposed directly, set `DS2API_HOST_PORT=5001` (or adjust the `ports` mapping).
 It also mounts `./config.json` to `/data/config.json` and sets `DS2API_CONFIG_PATH=/data/config.json` by default, which avoids runtime token persistence failures caused by read-only `/app`.
 Rebuild after updates: `docker-compose up -d --build`
@@ -242,6 +257,10 @@ Rebuild after updates: `docker-compose up -d --build`
 2. After deployment, open `/admin` and login with `DS2API_ADMIN_KEY` shown in Zeabur env/template instructions.
 3. Import / edit config in Admin UI (it will be written and persisted to `/data/config.json`).
 Fresh Zeabur volumes can start without `/data/config.json`; DS2API will boot with an empty file-backed config and create the file on the first Admin UI save.
 For manual deployment without the template, create a Zeabur GitHub service, keep Root Directory as `/`, build with the repo-root `Dockerfile`, mount a persistent volume at `/data`, set `PORT=5001`, `DS2API_ADMIN_KEY=your-strong-secret`, and `DS2API_CONFIG_PATH=/data/config.json`, then expose HTTP port `5001`. See [docs/DEPLOY.en.md](docs/DEPLOY.en.md#manual-deployment-without-the-template) for the full guide.
 Note: when Zeabur builds directly from the repo `Dockerfile`, you do not need to pass `BUILD_VERSION`. The image prefers that build arg when provided, and automatically falls back to the repo-root `VERSION` file when it is absent.
 ### Option 3: Vercel
@@ -302,8 +321,7 @@ Common fields:
 - `model_aliases`: one shared alias map for OpenAI / Claude / Gemini model names.
 - `runtime`: account concurrency, queueing, and token refresh behavior, hot-reloadable via Admin Settings.
 - `auto_delete.mode`: remote session cleanup after each request, supporting `none` / `single` / `all`.
- `history_split`: legacy multi-turn history split field, now ignored and kept only for backward-compatible config loading.
+- `current_input_file`: the global context split/upload mode; it is enabled by default and uploads the full context as a `DS2API_HISTORY.txt` context file once the character threshold is reached.
 - `current_input_file`: the only active split mode; it is enabled by default and uploads the full context as a hidden context file once the character threshold is reached.
 - If you turn off `current_input_file`, requests pass through directly without uploading any split context file.
 For the full environment variable list, see [docs/DEPLOY.en.md](docs/DEPLOY.en.md). For auth behavior, see [API.en.md](API.en.md#authentication).
@@ -406,10 +424,10 @@ npm run build --prefix webui
 Workflow: `.github/workflows/release-artifacts.yml`
- **Trigger**: only on GitHub Release `published` (normal pushes do not trigger builds)
+- **Trigger**: by default only on GitHub Release `published`; you can also run it manually via `workflow_dispatch` and pass `release_tag` to rerun / backfill
- **Outputs**: multi-platform archives (`linux/amd64`, `linux/arm64`, `linux/armv7`, `darwin/amd64`, `darwin/arm64`, `windows/amd64`, `windows/arm64`) + `sha256sums.txt`
+- **Outputs**: multi-platform binary archives (`linux/amd64`, `linux/arm64`, `linux/armv7`, `darwin/amd64`, `darwin/arm64`, `windows/amd64`, `windows/arm64`), Linux Docker image export tarballs, and `sha256sums.txt`
 - **Container publishing**: GHCR only (`ghcr.io/cjackhwang/ds2api`)
- **Each archive includes**: `ds2api` executable, `static/admin`, WASM file (with embedded fallback support), `config.example.json`-based config template, README, LICENSE
+- **Each binary archive includes**: the `ds2api` executable, `static/admin`, `config.example.json`, `.env.example`, `README.MD`, `README.en.md`, and `LICENSE`
 ## Disclaimer
--- a/2
+++ b/2
@@ -1 +1 @@
-4.1.3
+4.4.1
--- a/config.example.json
+++ b/config.example.json
@@ -43,10 +43,6 @@
    "gpt-5.3-codex": "deepseek-v4-pro",
    "o3": "deepseek-v4-pro"
  },
  "compat": {
    "wide_input_strict_output": true,
    "strip_reference_markers": true
  },
  "responses": {
    "store_ttl_seconds": 900
  },
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -9,8 +9,9 @@ services:
      # Host port is configurable via DS2API_HOST_PORT; container port stays fixed at 5001.
      - "${DS2API_HOST_PORT:-6011}:5001"
    volumes:
-      - ./config.json:/app/config.json    # 配置文件
+      - ./config.json:/data/config.json   # 配置文件（持久化推荐路径）
    environment:
      - TZ=Asia/Shanghai
      - LOG_LEVEL=INFO
      - DS2API_ADMIN_KEY=${DS2API_ADMIN_KEY:-ds2api}
      - DS2API_CONFIG_PATH=/data/config.json
--- a/docs/ARCHITECTURE.en.md
+++ b/docs/ARCHITECTURE.en.md
@@ -15,6 +15,7 @@ ds2api/
 │   └── workflows/                        # GitHub Actions workflows
 ├── api/                                  # Serverless entrypoints (Vercel Go/Node)
 ├── app/                                  # Application-level handler assembly
 ├── artifacts/                            # Debug artifacts (raw-stream-sim, stream-debug, etc.)
 ├── cmd/                                  # Executable entrypoints
 │   ├── ds2api/                           # Main service bootstrap
 │   └── ds2api-tests/                     # E2E testsuite CLI bootstrap
@@ -25,6 +26,8 @@ ds2api/
 │   ├── chathistory/                      # Server-side conversation history storage/query
 │   ├── claudeconv/                       # Claude message conversion helpers
 │   ├── compat/                           # Compatibility and regression helpers
 │   ├── assistantturn/                    # Upstream output to canonical assistant turn / stream event semantics
 │   ├── completionruntime/                # Shared Go DeepSeek completion startup, non-stream collection, and retry
 │   ├── config/                           # Config loading/validation/hot reload
 │   ├── deepseek/                         # DeepSeek upstream client/protocol/transport
 │   │   ├── client/                       # Login/session/completion/upload/delete calls
@@ -38,13 +41,14 @@ ds2api/
 │   │   ├── admin/                        # Admin API root assembly and resource packages
 │   │   ├── claude/                       # Claude HTTP protocol adapter
 │   │   ├── gemini/                       # Gemini HTTP protocol adapter
-│   │   └── openai/                       # OpenAI HTTP surface
+│   │   ├── openai/                       # OpenAI HTTP surface
-│   │       ├── chat/                     # Chat Completions execution entrypoint
+│   │   │   ├── chat/                     # Chat Completions execution entrypoint
-│   │       ├── responses/                # Responses API and response store
+│   │   │   ├── responses/                # Responses API and response store
-│   │       ├── files/                    # Files API and inline-file preprocessing
+│   │   │   ├── files/                    # Files API and inline-file preprocessing
-│   │       ├── embeddings/               # Embeddings API
+│   │   │   ├── embeddings/               # Embeddings API
-│   │       ├── history/                  # OpenAI context file handling
+│   │   │   ├── history/                  # OpenAI context file handling
-│   │       └── shared/                   # OpenAI HTTP errors/models/tool formatting
+│   │   │   └── shared/                   # OpenAI HTTP errors/models/tool formatting
 │   │   └── requestbody/                  # HTTP body reading and UTF-8/JSON validation helpers
 │   ├── js/                               # Node runtime related logic
 │   │   ├── chat-stream/                  # Node streaming bridge
 │   │   ├── helpers/                      # JS helper modules
@@ -61,13 +65,14 @@ ds2api/
 │   ├── textclean/                        # Text cleanup
 │   ├── toolcall/                         # Tool-call parsing and repair
 │   ├── toolstream/                       # Go streaming tool-call anti-leak and delta detection
-│   ├── translatorcliproxy/               # Cross-protocol translation bridge
+│   ├── translatorcliproxy/               # Vercel/fallback/test protocol translation bridge
 │   ├── util/                             # Shared utility helpers
 │   ├── version/                          # Version query/compare
 │   └── webui/                            # WebUI static hosting logic
 ├── plans/                                # Stage plans and manual QA records
 ├── pow/                                  # PoW standalone implementation + benchmarks
 ├── scripts/                              # Build/release helper scripts
 ├── static/                               # Build artifacts (admin static resources)
 ├── tests/                                # Test assets and scripts
 │   ├── compat/                           # Compatibility fixtures + expected outputs
 │   │   ├── expected/                     # Expected output samples
@@ -76,9 +81,9 @@ ds2api/
 │   │       └── toolcalls/                # Tool-call fixtures
 │   ├── node/                             # Node unit tests
 │   ├── raw_stream_samples/               # Upstream raw SSE samples
 │   │   ├── content-filter-trigger-20260405-jwt3/          # Content-filter terminal sample
 │   │   ├── continue-thinking-snapshot-replay-20260405/    # Continue-thinking sample
-│   │   ├── guangzhou-weather-reasoner-search-20260404/    # Search/reference sample
+│   │   ├── longtext-deepseek-v4-flash-20260429/           # Flash long-text/file-upload sample
 │   │   ├── longtext-deepseek-v4-pro-20260429/             # Pro long-text/file-upload sample
 │   │   ├── markdown-format-example-20260405/              # Markdown sample
 │   │   └── markdown-format-example-20260405-spacefix/     # Space-fix sample
 │   ├── scripts/                          # Test entry scripts
@@ -91,6 +96,8 @@ ds2api/
        ├── features/                     # Feature modules
        │   ├── account/                  # Account management page
        │   ├── apiTester/                # API tester page
        │   ├── chatHistory/              # Server-side conversation history page
        │   ├── proxy/                    # Proxy management page
        │   ├── settings/                 # Settings page
        │   └── vercel/                   # Vercel sync page
        ├── layout/                       # Layout components
@@ -124,8 +131,11 @@ flowchart LR
    subgraph RUNTIME[Shared runtime]
        AUTH[internal/auth]
        POOL[internal/account queue + concurrency]
        CR[internal/completionruntime]
        TURN[internal/assistantturn]
        STREAM[internal/stream + internal/sse]
        TOOL[internal/toolcall + internal/toolstream]
        FMT[internal/format/openai + claude]
        DS[internal/deepseek/client]
        POW[pow + internal/deepseek/protocol]
    end
@@ -151,16 +161,24 @@ flowchart LR
    PC --> PROMPT
    PC -.long history.-> HIST
    PC --> AUTH
    PC --> CR
    NCS -.Go prepare/release.-> CHAT
    NCS --> JS
    JS --> TOOL
    AUTH --> POOL
-    CHAT --> STREAM
+    CHAT --> CR
-    RESP --> STREAM
+    RESP --> CR
    CA --> CR
    GA --> CR
    CR --> DS
    CR --> STREAM
    CR --> TURN
    STREAM --> TURN
    STREAM --> TOOL
-    POOL --> DS
+    TURN --> FMT
    POOL --> CR
    DS --> POW
    DS --> U[DeepSeek upstream]
 ```
@@ -169,9 +187,12 @@ flowchart LR
 - `internal/server`: router tree + middlewares (health, protocol routes, Admin/WebUI).
 - `internal/httpapi/openai/*`: OpenAI HTTP surface split into chat, responses, files, embeddings, history, and shared packages; chat/responses share the promptcompat, stream, and toolcall semantics.
- `internal/httpapi/{claude,gemini}`: protocol wrappers that normalize into the same prompt compatibility semantics without duplicating upstream execution.
+- `internal/httpapi/{claude,gemini}`: protocol adapters that normalize into the same prompt compatibility semantics; normal direct paths must share DeepSeek session/PoW/completion execution through `completionruntime`, while `translatorcliproxy` is reserved for Vercel prepare/release, missing-backend fallback, and regression tests.
 - `internal/httpapi/requestbody`: shared HTTP body reading, JSON pre-validation, and UTF-8 error helpers across protocol adapters.
 - `internal/promptcompat`: compatibility core for turning OpenAI/Claude/Gemini requests into DeepSeek web-chat plain-text context.
- `internal/translatorcliproxy`: structure translation between Claude/Gemini and OpenAI.
+- `internal/assistantturn`: Go output-side canonical semantics, converting DeepSeek SSE collection results and stream finalization state into assistant turns and centralizing thinking, tool call, citation, usage, stop/error behavior.
 - `internal/completionruntime`: shared Go completion execution helpers for DeepSeek session/PoW/call startup, non-stream collection, and empty-output retry; streaming paths use it to start upstream requests, continue to use `internal/stream` for real-time consumption, and use `assistantturn` during finalization.
 - `internal/translatorcliproxy`: bridge compatibility layer for Claude/Gemini and OpenAI shape translation; it is not the main business protocol conversion center.
 - `internal/deepseek/{client,protocol,transport}`: upstream requests, sessions, PoW adaptation, protocol constants, and transport details.
 - `internal/js/chat-stream` + `api/chat-stream.js`: Vercel Node streaming bridge; Go prepare/release owns auth, account lease, and completion payload assembly, while Node relays real-time SSE with Go-aligned finalization and tool sieve semantics.
 - `internal/stream` + `internal/sse`: Go stream parsing and incremental assembly.
@@ -180,6 +201,13 @@ flowchart LR
 - `internal/chathistory`: server-side conversation history persistence, pagination, detail lookup, and retention policy.
 - `internal/config`: config loading/validation + runtime settings hot-reload.
 - `internal/account`: managed account pool, inflight slots, waiting queue.
 - `internal/textclean`: text cleanup helpers, e.g. stripping `[reference: N]` markers.
 - `internal/claudeconv`: Claude API request to DeepSeek format conversion.
 - `internal/compat`: compatibility regression tests using SSE fixtures to verify output consistency.
 - `internal/rawsample`: upstream raw response capture, read/write, and management.
 - `internal/devcapture`: developer debug capture, storing HTTP request/response for troubleshooting.
 - `internal/util`: cross-package utilities including JSON writing, type conversion, token counting, thinking parsing, etc.
 - `internal/version`: version query and comparison, supporting build-time injection and runtime resolution.
 ## 4. WebUI Runtime Relation
--- a/docs/ARCHITECTURE.md
+++ b/docs/ARCHITECTURE.md
@@ -15,6 +15,7 @@ ds2api/
 │   └── workflows/                        # GitHub Actions 工作流
 ├── api/                                  # Serverless 入口（Vercel Go/Node）
 ├── app/                                  # 应用级 handler 装配层
 ├── artifacts/                            # 调试产物（raw-stream-sim, stream-debug 等）
 ├── cmd/                                  # 可执行程序入口
 │   ├── ds2api/                           # 主服务启动入口
 │   └── ds2api-tests/                     # E2E 测试集 CLI 入口
@@ -25,6 +26,8 @@ ds2api/
 │   ├── chathistory/                      # 服务器端对话记录存储与查询
 │   ├── claudeconv/                       # Claude 消息格式转换工具
 │   ├── compat/                           # 兼容性辅助与回归支持
 │   ├── assistantturn/                    # 上游输出到统一 assistant turn / stream event 的语义层
 │   ├── completionruntime/                # Go 主路径共享 DeepSeek completion 启动、非流式收集与 retry
 │   ├── config/                           # 配置加载、校验、热更新
 │   ├── deepseek/                         # DeepSeek 上游 client/protocol/transport
 │   │   ├── client/                       # 登录、会话、completion、上传/删除等上游调用
@@ -38,13 +41,14 @@ ds2api/
 │   │   ├── admin/                        # Admin API 根装配与资源子包
 │   │   ├── claude/                       # Claude HTTP 协议适配
 │   │   ├── gemini/                       # Gemini HTTP 协议适配
-│   │   └── openai/                       # OpenAI HTTP surface
+│   │   ├── openai/                       # OpenAI HTTP surface
-│   │       ├── chat/                     # Chat Completions 执行入口
+│   │   │   ├── chat/                     # Chat Completions 执行入口
-│   │       ├── responses/                # Responses API 与 response store
+│   │   │   ├── responses/                # Responses API 与 response store
-│   │       ├── files/                    # Files API 与 inline file 预处理
+│   │   │   ├── files/                    # Files API 与 inline file 预处理
-│   │       ├── embeddings/               # Embeddings API
+│   │   │   ├── embeddings/               # Embeddings API
-│   │       ├── history/                  # OpenAI context file handling
+│   │   │   ├── history/                  # OpenAI context file handling
-│   │       └── shared/                   # OpenAI HTTP 公共错误/模型/工具格式
+│   │   │   └── shared/                   # OpenAI HTTP 公共错误/模型/工具格式
 │   │   └── requestbody/                  # HTTP 请求体读取与 UTF-8/JSON 校验辅助
 │   ├── js/                               # Node Runtime 相关逻辑
 │   │   ├── chat-stream/                  # Node 流式输出桥接
 │   │   ├── helpers/                      # JS 辅助函数
@@ -61,13 +65,14 @@ ds2api/
 │   ├── textclean/                        # 文本清洗
 │   ├── toolcall/                         # 工具调用解析与修复
 │   ├── toolstream/                       # Go 流式 tool call 防泄漏与增量检测
-│   ├── translatorcliproxy/               # 多协议互转桥
+│   ├── translatorcliproxy/               # Vercel/fallback/测试用协议互转桥
 │   ├── util/                             # 通用工具函数
 │   ├── version/                          # 版本查询/比较
 │   └── webui/                            # WebUI 静态托管相关逻辑
 ├── plans/                                # 阶段计划与人工验收记录
 ├── pow/                                  # PoW 独立实现与基准
 ├── scripts/                              # 构建/发布/辅助脚本
 ├── static/                               # 构建产物（admin 等静态资源）
 ├── tests/                                # 测试资源与脚本
 │   ├── compat/                           # 兼容性夹具与期望输出
 │   │   ├── expected/                     # 预期结果样本
@@ -76,9 +81,9 @@ ds2api/
 │   │       └── toolcalls/                # toolcall 夹具
 │   ├── node/                             # Node 单元测试
 │   ├── raw_stream_samples/               # 上游原始 SSE 样本
 │   │   ├── content-filter-trigger-20260405-jwt3/          # 风控终态样本
 │   │   ├── continue-thinking-snapshot-replay-20260405/    # continue 样本
-│   │   ├── guangzhou-weather-reasoner-search-20260404/    # 搜索+引用样本
+│   │   ├── longtext-deepseek-v4-flash-20260429/           # flash 长文本/文件上传样本
 │   │   ├── longtext-deepseek-v4-pro-20260429/             # pro 长文本/文件上传样本
 │   │   ├── markdown-format-example-20260405/              # Markdown 样本
 │   │   └── markdown-format-example-20260405-spacefix/     # 空格修复样本
 │   ├── scripts/                          # 测试脚本入口
@@ -91,6 +96,8 @@ ds2api/
        ├── features/                     # 功能模块
        │   ├── account/                  # 账号管理页面
        │   ├── apiTester/                # API 测试页面
        │   ├── chatHistory/              # 服务器端对话记录页面
        │   ├── proxy/                    # 代理管理页面
        │   ├── settings/                 # 设置页面
        │   └── vercel/                   # Vercel 同步页面
        ├── layout/                       # 布局组件
@@ -124,8 +131,11 @@ flowchart LR
    subgraph RUNTIME[Shared runtime]
        AUTH[internal/auth]
        POOL[internal/account queue + concurrency]
        CR[internal/completionruntime]
        TURN[internal/assistantturn]
        STREAM[internal/stream + internal/sse]
        TOOL[internal/toolcall + internal/toolstream]
        FMT[internal/format/openai + claude]
        DS[internal/deepseek/client]
        POW[pow + internal/deepseek/protocol]
    end
@@ -151,16 +161,24 @@ flowchart LR
    PC --> PROMPT
    PC -.长历史.-> HIST
    PC --> AUTH
    PC --> CR
    NCS -.Go prepare/release.-> CHAT
    NCS --> JS
    JS --> TOOL
    AUTH --> POOL
-    CHAT --> STREAM
+    CHAT --> CR
-    RESP --> STREAM
+    RESP --> CR
    CA --> CR
    GA --> CR
    CR --> DS
    CR --> STREAM
    CR --> TURN
    STREAM --> TURN
    STREAM --> TOOL
-    POOL --> DS
+    TURN --> FMT
    POOL --> CR
    DS --> POW
    DS --> U[DeepSeek upstream]
 ```
@@ -169,9 +187,12 @@ flowchart LR
 - `internal/server`：路由树和中间件挂载（健康检查、协议入口、Admin/WebUI）。
 - `internal/httpapi/openai/*`：OpenAI HTTP surface，按 chat、responses、files、embeddings、history、shared 拆分；chat/responses 共享 promptcompat、stream、toolcall 等核心语义。
- `internal/httpapi/{claude,gemini}`：协议输入输出适配，归一到同一套 prompt compatibility 语义，不重复实现上游调用逻辑。
+- `internal/httpapi/{claude,gemini}`：协议输入输出适配，归一到同一套 prompt compatibility 语义；正常直连路径必须通过 `completionruntime` 共享 DeepSeek session/PoW/completion 调用，`translatorcliproxy` 仅保留给 Vercel prepare/release、后端缺失 fallback 和回归测试。
 - `internal/httpapi/requestbody`：跨协议复用的请求体读取、JSON 解码前置校验与 UTF-8 错误处理辅助。
 - `internal/promptcompat`：OpenAI/Claude/Gemini 请求到 DeepSeek 网页纯文本上下文的兼容内核。
- `internal/translatorcliproxy`：Claude/Gemini 与 OpenAI 结构互转。
+- `internal/assistantturn`：Go 输出侧统一语义层，把 DeepSeek SSE 收集结果和流式收尾状态归一成 assistant turn，集中处理 thinking、tool call、citation、usage、stop/error 语义。
 - `internal/completionruntime`：Go surface 共享的 completion 执行辅助，负责 DeepSeek session/PoW/call 启动、非流式 collect 和 empty-output retry；流式路径复用它启动上游请求，继续用 `internal/stream` 做实时消费，并在最终收尾阶段接入 `assistantturn`。
 - `internal/translatorcliproxy`：Claude/Gemini 与 OpenAI 结构互转的桥接兼容层，不作为主业务协议转换中心。
 - `internal/deepseek/{client,protocol,transport}`：上游请求、会话、PoW 适配、协议常量与传输层。
 - `internal/js/chat-stream` + `api/chat-stream.js`：Vercel Node 流式桥；Go prepare/release 管理鉴权、账号租约和 completion payload，Node 侧负责实时 SSE 转发并保持 Go 对齐的终结态和 tool sieve 语义。
 - `internal/stream` + `internal/sse`：Go 流式解析与增量处理。
@@ -180,6 +201,13 @@ flowchart LR
 - `internal/chathistory`：服务器端对话记录持久化、分页、单条详情和保留策略。
 - `internal/config`：配置加载、校验、运行时 settings 热更新。
 - `internal/account`：托管账号池、并发槽位、等待队列。
 - `internal/textclean`：文本清洗，移除 `[reference: N]` 标记等噪声。
 - `internal/claudeconv`：Claude API 请求到 DeepSeek 格式的协议转换。
 - `internal/compat`：兼容性回归测试套件，用 SSE 夹具验证输出一致性。
 - `internal/rawsample`：上游原始响应的采集、读写与管理。
 - `internal/devcapture`：开发调试抓包，存储 HTTP 请求/响应用于问题排查。
 - `internal/util`：跨包通用工具，含 JSON 写入、类型转换、token 计数、thinking 解析等。
 - `internal/version`：版本号查询与比较，支持构建注入和运行时解析。
 ## 4. WebUI 与运行时关系
--- a/docs/CONTRIBUTING.en.md
+++ b/docs/CONTRIBUTING.en.md
@@ -36,7 +36,7 @@ go run ./cmd/ds2api
 cd webui
 # 2. Install dependencies
-npm install
+npm ci
 # 3. Start dev server (hot reload)
 npm run dev
--- a/docs/CONTRIBUTING.md
+++ b/docs/CONTRIBUTING.md
@@ -36,7 +36,7 @@ go run ./cmd/ds2api
 cd webui
 # 2. 安装依赖
-npm install
+npm ci
 # 3. 启动开发服务器（热更新）
 npm run dev
--- a/docs/DEPLOY.en.md
+++ b/docs/DEPLOY.en.md
@@ -64,8 +64,8 @@ Use `config.json` as the single source of truth:
 Built-in GitHub Actions workflow: `.github/workflows/release-artifacts.yml`
- **Trigger**: only on Release `published` (no build on normal push)
+- **Trigger**: by default only on Release `published`; you can also run it manually via `workflow_dispatch` and pass `release_tag` to rerun / backfill
- **Outputs**: multi-platform binary archives + `sha256sums.txt`
+- **Outputs**: multi-platform binary archives, Linux Docker image export tarballs, and `sha256sums.txt`
 - **Container publishing**: GHCR only (`ghcr.io/cjackhwang/ds2api`)
 | Platform | Architecture | Format |
@@ -130,6 +130,9 @@ docker-compose logs -f
 ```
 The default `docker-compose.yml` directly uses `ghcr.io/cjackhwang/ds2api:latest` and maps host port `6011` to container port `5001`. If you want `5001` exposed directly, set `DS2API_HOST_PORT=5001` (or adjust the `ports` mapping).
 The compose template also defaults to `DS2API_CONFIG_PATH=/data/config.json` with `./config.json:/data/config.json` mounted, so deployments avoid read-only `/app` persistence issues by default.
 The image pre-creates `/data` and grants it to the non-root `ds2api` user. If you bind-mount a single host file, make sure `config.json` is readable/writable by the container user, for example with `chmod 644 config.json`; otherwise Linux UID/GID mismatches can still cause `open /data/config.json: permission denied`.
 Compatibility note: when `DS2API_CONFIG_PATH` is unset and runtime base dir is `/app`, newer versions prefer `/data/config.json`; if that file is missing but legacy `/app/config.json` exists, DS2API automatically falls back to the legacy path to avoid post-upgrade config loss.
 If you want a pinned version instead of `latest`, you can also pull a specific tag directly:
@@ -194,10 +197,46 @@ This repo includes a `zeabur.yaml` template for one-click deployment on Zeabur:
 Notes:
 - **Port**: DS2API listens on `5001` by default; the template sets `PORT=5001`.
- **Persistent config**: the template mounts `/data` and sets `DS2API_CONFIG_PATH=/data/config.json`. After importing config in Admin UI, it will be written and persisted to this path.
+- **Persistent config**: the template mounts `/data` and sets `DS2API_CONFIG_PATH=/data/config.json`. On a fresh volume, DS2API starts with an empty file-backed config; after importing config in Admin UI, it will be written and persisted to this path.
 - **`open /app/config.json: permission denied`**: this means the instance is trying to persist runtime tokens to a read-only path (commonly `/app` inside the image).  
  Recommended handling:
  1. Set a writable path explicitly: `DS2API_CONFIG_PATH=/data/config.json` (and mount a persistent volume at `/data`);
  2. If you bootstrap with `DS2API_CONFIG_JSON` and do not need runtime writeback, keep env-backed mode (`DS2API_ENV_WRITEBACK` disabled);
  3. In current versions, login/session tests continue even if persistence fails; Admin API returns a warning that token persistence failed and token is memory-only until restart.
 - **Build version**: Zeabur / regular `docker build` does not require `BUILD_VERSION` by default. The image prefers that build arg when provided, and automatically falls back to the repo-root `VERSION` file when it is absent.
 - **First login**: after deployment, open `/admin` and login with `DS2API_ADMIN_KEY` shown in Zeabur env/template instructions (recommended: rotate to a strong secret after first login).
 #### Manual Deployment Without The Template
 If you do not want to use the `zeabur.yaml` one-click template, deploy directly from the repo root with Zeabur's GitHub integration:
 1. Fork this repo, or push the code to your own GitHub repository.
 2. In Zeabur Dashboard, create a Project, add a Service, then choose a GitHub/Git repository source.
 3. Select the repository and branch. Keep Root Directory as `/`.
 4. Use the Dockerfile build path. Zeabur auto-detects the repo-root `Dockerfile`; do not set `ZBPACK_IGNORE_DOCKERFILE=true`. If the UI asks for a Dockerfile name, enter `Dockerfile`.
 5. Add a persistent volume in the Service settings and mount it at `/data`.
 6. Configure environment variables:
 | Variable | Recommended value | Description |
 | --- | --- | --- |
 | `PORT` | `5001` | Service listen port; keep it aligned with the exposed Zeabur HTTP port. |
 | `DS2API_ADMIN_KEY` | Strong random string | Required admin login key. |
 | `DS2API_CONFIG_PATH` | `/data/config.json` | Recommended persistent config path. |
 | `LOG_LEVEL` | `INFO` | Optional log level. |
 | `DS2API_CONFIG_JSON` | Raw JSON or Base64 JSON | Optional config bootstrap from env. |
 | `DS2API_ENV_WRITEBACK` | `1` | Optional; enable only when using `DS2API_CONFIG_JSON` and you want the initial config written to `/data/config.json`. |
 7. Expose HTTP port `5001`. The health check path can be `/healthz`.
 8. After deployment, open `/admin`, login with `DS2API_ADMIN_KEY`, then import or edit config in Admin UI. A fresh volume does not need `/data/config.json` up front; the service boots first and creates the file on the first save.
 Troubleshooting:
 - **Startup log says `open /data/config.json: no such file or directory`**: make sure you deployed a version that includes the fresh-volume bootstrap fix, then redeploy the latest code.
 - **`open /app/config.json: permission denied`**: the config path still points at the read-only image directory; mount `/data` and set `DS2API_CONFIG_PATH=/data/config.json`.
 - **Config disappears after restart**: check that the `/data` persistent volume is mounted on this service. If you use `DS2API_CONFIG_JSON` but want Admin UI saves persisted, enable `DS2API_ENV_WRITEBACK=1`.
 References: Zeabur's official [GitHub/Git integration](https://zeabur.com/docs/en-US/deploy/github), [Dockerfile deployment](https://zeabur.com/docs/en-US/deploy/dockerfile), and [Volumes](https://zeabur.com/docs/data-management/volumes) docs.
 ---
 ## 3. Vercel Deployment
@@ -263,6 +302,7 @@ VERCEL_TEAM_ID=team_xxxxxxxxxxxx   # optional for personal accounts
 | `VERCEL_TOKEN` | Vercel sync token | — |
 | `VERCEL_PROJECT_ID` | Vercel project ID | — |
 | `VERCEL_TEAM_ID` | Vercel team ID | — |
 | `DS2API_CHAT_HISTORY_PATH` | Chat history storage path (must be set to `/tmp/chat_history.json` on Vercel, otherwise unavailable due to read-only filesystem) | `data/chat_history.json` |
 | `DS2API_VERCEL_PROTECTION_BYPASS` | Deployment protection bypass for internal Node→Go calls | — |
 ### 3.4 Vercel Architecture
@@ -352,6 +392,22 @@ If API responses return Vercel HTML `Authentication Required`:
 - **Option B**: Add `x-vercel-protection-bypass` header to requests
 - **Option C**: Set `VERCEL_AUTOMATION_BYPASS_SECRET` (or `DS2API_VERCEL_PROTECTION_BYPASS`) for internal Node→Go calls
 #### Chat History Unavailable (read-only file system)
 ```text
 create chat history dir: mkdir /var/task/data: read-only file system
 ```
 **Cause**: Vercel Serverless functions have a read-only filesystem (`/var/task`). Chat history fails because it cannot create directories there.
 **Fix**: Add the following in Vercel Project Settings → Environment Variables:
 ```text
 DS2API_CHAT_HISTORY_PATH=/tmp/chat_history.json
 ```
 `/tmp` is the only writable directory in Vercel Serverless. Data is ephemeral (not persisted across cold starts), but the feature works within a single instance lifetime.
 ### 3.6 Build Artifacts Not Committed
 - `static/admin` directory is not in Git
@@ -394,7 +450,7 @@ Or step by step:
 ```bash
 cd webui
-npm install
+npm ci
 npm run build
 # Output goes to static/admin/
 ```
--- a/docs/DEPLOY.md
+++ b/docs/DEPLOY.md
@@ -64,8 +64,8 @@ cp config.example.json config.json
 仓库内置 GitHub Actions 工作流：`.github/workflows/release-artifacts.yml`
- **触发条件**：仅在 Release `published` 时触发（普通 push 不会构建）
+- **触发条件**：默认仅在 Release `published` 时自动触发；也支持在 Actions 页面手动 `workflow_dispatch`，并填写 `release_tag` 复跑/补发
- **构建产物**：多平台二进制压缩包 + `sha256sums.txt`
+- **构建产物**：多平台二进制压缩包、Linux Docker 镜像导出包 + `sha256sums.txt`
 - **容器镜像发布**：仅发布到 GHCR（`ghcr.io/cjackhwang/ds2api`）
 | 平台 | 架构 | 文件格式 |
@@ -130,6 +130,9 @@ docker-compose logs -f
 ```
 默认 `docker-compose.yml` 直接使用 `ghcr.io/cjackhwang/ds2api:latest`，并把宿主机 `6011` 映射到容器内的 `5001`。如果你希望直接对外暴露 `5001`，请设置 `DS2API_HOST_PORT=5001`（或者手动调整 `ports` 配置）。
 Compose 模板还会默认设置 `DS2API_CONFIG_PATH=/data/config.json` 并挂载 `./config.json:/data/config.json`，优先避免 `/app` 只读带来的配置持久化问题。
 镜像内会预创建 `/data` 并授权给非 root 的 `ds2api` 用户；如果你使用 bind mount 单文件，请确保宿主机 `config.json` 至少可被容器用户读取/写入，例如 `chmod 644 config.json`，否则 Linux UID/GID 不一致时仍可能出现 `open /data/config.json: permission denied`。
 兼容说明：若未设置 `DS2API_CONFIG_PATH` 且运行目录是 `/app`，新版本会优先使用 `/data/config.json`；当该文件不存在但检测到历史 `/app/config.json` 时，会自动回退读取旧路径，避免升级后“配置丢失”。
 如需固定版本，也可以直接拉取指定 tag：
@@ -194,10 +197,46 @@ healthcheck:
 部署要点：
 - **端口**：服务默认监听 `5001`，模板会固定设置 `PORT=5001`。
- **配置持久化**：模板挂载卷 `/data`，并设置 `DS2API_CONFIG_PATH=/data/config.json`；在管理台导入配置后，会写入并持久化到该路径。
+- **配置持久化**：模板挂载卷 `/data`，并设置 `DS2API_CONFIG_PATH=/data/config.json`；首次空卷启动时会先使用空的文件模式配置，在管理台导入配置后，会写入并持久化到该路径。
 - **`open /app/config.json: permission denied`**：说明当前实例在尝试把运行时 token 持久化到只读路径（常见于镜像内 `/app`）。  
  处理建议：
  1. 显式设置可写路径：`DS2API_CONFIG_PATH=/data/config.json`（并挂载持久卷到 `/data`）；  
  2. 若你使用 `DS2API_CONFIG_JSON` 启动且不需要运行时落盘，可保持环境变量模式（`DS2API_ENV_WRITEBACK` 关闭）；  
  3. 最新版本中，即使持久化失败，登录/会话测试仍会继续，仅提示“token 未持久化（重启后丢失）”。
 - **构建版本号**：Zeabur / 普通 `docker build` 默认不需要传 `BUILD_VERSION`；镜像会优先使用该构建参数，未提供时自动回退到仓库根目录的 `VERSION` 文件。
 - **首次登录**：部署完成后访问 `/admin`，使用 Zeabur 环境变量/模板指引中的 `DS2API_ADMIN_KEY` 登录（建议首次登录后自行更换为强密码）。
 #### 不使用模板手动部署
 如果你不想使用 `zeabur.yaml` 一键模板，可以直接用 Zeabur 的 GitHub 集成从仓库根目录构建：
 1. Fork 本仓库，或把代码推送到你自己的 GitHub 仓库。
 2. 在 Zeabur Dashboard 中创建 Project，然后添加 Service，选择 GitHub/Git 仓库来源。
 3. 选择仓库与分支，Root Directory 保持 `/`。
 4. 构建方式使用 Dockerfile。Zeabur 会自动检测仓库根目录的 `Dockerfile`；不要设置 `ZBPACK_IGNORE_DOCKERFILE=true`。如果界面要求填写 Dockerfile 名称，填写 `Dockerfile`。
 5. 在 Service 配置中添加持久卷，挂载目录填写 `/data`。
 6. 配置环境变量：
 | 变量 | 推荐值 | 说明 |
 | --- | --- | --- |
 | `PORT` | `5001` | 服务监听端口，需要和 Zeabur 暴露的 HTTP 端口一致。 |
 | `DS2API_ADMIN_KEY` | 强随机字符串 | 管理台登录密钥，必填。 |
 | `DS2API_CONFIG_PATH` | `/data/config.json` | 配置持久化路径，建议必填。 |
 | `LOG_LEVEL` | `INFO` | 可选，日志级别。 |
 | `DS2API_CONFIG_JSON` | 原始 JSON 或 Base64 JSON | 可选，用于用环境变量初始化配置。 |
 | `DS2API_ENV_WRITEBACK` | `1` | 可选；当设置了 `DS2API_CONFIG_JSON` 且希望首次启动后写入 `/data/config.json` 时再启用。 |
 7. 暴露 HTTP 端口 `5001`，健康检查路径可填 `/healthz`。
 8. 部署完成后访问 `/admin`，用 `DS2API_ADMIN_KEY` 登录，然后在管理台导入或编辑配置。首次空卷可以没有 `/data/config.json`，服务会先启动，第一次保存时自动创建该文件。
 常见问题：
 - **启动日志出现 `open /data/config.json: no such file or directory`**：请确认已经部署包含“首次空卷启动”修复的版本，并重新部署最新代码。
 - **出现 `open /app/config.json: permission denied`**：说明配置路径仍指向镜像内只读目录；设置持久卷 `/data`，并确认 `DS2API_CONFIG_PATH=/data/config.json`。
 - **管理台保存后重启配置丢失**：检查 `/data` 持久卷是否已挂载到当前服务；如果使用了 `DS2API_CONFIG_JSON`，但想让管理台保存落盘，请启用 `DS2API_ENV_WRITEBACK=1`。
 参考：Zeabur 官方文档的 [GitHub/Git 集成](https://zeabur.com/docs/en-US/deploy/github)、[Dockerfile 部署](https://zeabur.com/docs/zh-CN/deploy/dockerfile) 与 [Volumes](https://zeabur.com/docs/data-management/volumes)。
 ---
 ## 三、Vercel 部署
@@ -263,6 +302,7 @@ VERCEL_TEAM_ID=team_xxxxxxxxxxxx   # 个人账号可留空
 | `VERCEL_TOKEN` | Vercel 同步 token | — |
 | `VERCEL_PROJECT_ID` | Vercel 项目 ID | — |
 | `VERCEL_TEAM_ID` | Vercel 团队 ID | — |
 | `DS2API_CHAT_HISTORY_PATH` | Chat history 存储路径（Vercel 上必须设为 `/tmp/chat_history.json`，否则因文件系统只读而不可用） | `data/chat_history.json` |
 | `DS2API_VERCEL_PROTECTION_BYPASS` | 部署保护绕过密钥（内部 Node→Go 调用） | — |
 ### 3.3 运行时行为配置（通过 Admin API 设置）
@@ -362,6 +402,22 @@ No Output Directory named "public" found after the Build completed.
 - **方案 B**：请求中添加 `x-vercel-protection-bypass` 头
 - **方案 C**：设置 `VERCEL_AUTOMATION_BYPASS_SECRET`（或 `DS2API_VERCEL_PROTECTION_BYPASS`），仅影响内部 Node→Go 调用
 #### Chat History 不可用（read-only file system）
 ```text
 create chat history dir: mkdir /var/task/data: read-only file system
 ```
 **原因**：Vercel Serverless 函数的文件系统（`/var/task`）为只读，chat history 尝试在该路径下创建目录失败。
 **解决**：在 Vercel Project Settings → Environment Variables 中添加：
 ```text
 DS2API_CHAT_HISTORY_PATH=/tmp/chat_history.json
 ```
 `/tmp` 是 Vercel Serverless 环境中唯一可写的目录。数据在函数冷启动之间不会持久化（ephemeral），但在单个实例生命周期内功能正常。
 ### 3.6 仓库不提交构建产物
 - `static/admin` 目录不在 Git 中
@@ -404,7 +460,7 @@ go run ./cmd/ds2api
 ```bash
 cd webui
-npm install
+npm ci
 npm run build
 # 产物输出到 static/admin/
 ```
--- a/docs/DEVELOPMENT.md
+++ b/docs/DEVELOPMENT.md
@@ -68,12 +68,13 @@ gofmt -w <changed-go-files>
 3. 请求归一化：`internal/promptcompat` 或协议转换包。
 4. 上游请求：`internal/deepseek/client`。
 5. 流式输出：`internal/stream`、`internal/sse`、`internal/toolstream`。
-6. 响应格式：`internal/format/*` 或 `internal/translatorcliproxy`。
+6. 响应格式：主路径看 `internal/assistantturn` 与 `internal/format/*`；`internal/translatorcliproxy` 只用于 Vercel/fallback/test 桥接。
 对话记录页面问题优先检查：
 - Admin API：`/admin/chat-history`、`/admin/chat-history/{id}`。
 - 后端存储：`internal/chathistory/store.go`。
 - 输出归档：`internal/responsehistory` 在协议回译/裁剪前记录 DeepSeek 上游 assistant text / thinking；即使工具调用已被对外响应转成结构化 `tool_calls` 并从可见正文剔除，后台历史仍应保留原始 DSML / XML 片段，方便排查格式漂移。
 - 前端轮询和 ETag：`webui/src/features/chatHistory/ChatHistoryContainer.jsx`。
 Tool call 问题优先跑：
--- a/docs/DeepSeekSSE行为结构说明-2026-04-05.md
+++ b/docs/DeepSeekSSE行为结构说明-2026-04-05.md
@@ -1,7 +1,7 @@
 # DeepSeek SSE 行为结构说明（第三方逆向版）
 > 说明：本文基于当前仓库 `tests/raw_stream_samples/` 下全部 `upstream.stream.sse` 原始流样本整理而成，属于第三方逆向观察文档，不是官方协议。
-> 当前 corpus 由 4 份原始流组成，覆盖搜索+引用、风控终态、Markdown 输出和空格敏感输出等行为。
+> 当前 corpus 由 5 份原始流组成，覆盖长文本生成、文件上传上下文、continue 接续、Markdown 输出和空格敏感输出等行为。
 > 补充：文末还会注明少量“当前实现已确认、但 corpus 尚未完整覆盖”的行为，例如长思考场景下的自动续写状态。
 文档导航：[文档总索引](./README.md) / [测试指南](./TESTING.md) / [样本目录说明](../tests/raw_stream_samples/README.md)
@@ -12,8 +12,9 @@
 | 样本 | 观察重点 |
 | --- | --- |
-| [guangzhou-weather-reasoner-search-20260404](../tests/raw_stream_samples/guangzhou-weather-reasoner-search-20260404/upstream.stream.sse) | 搜索+思考流程，包含 `reference:N` 引用标记与工具片段 |
+| [longtext-deepseek-v4-flash-20260429](../tests/raw_stream_samples/longtext-deepseek-v4-flash-20260429/upstream.stream.sse) | DeepSeek V4 flash 长文本流，包含 current input file 上传后的 completion 样本 |
-| [content-filter-trigger-20260405-jwt3](../tests/raw_stream_samples/content-filter-trigger-20260405-jwt3/upstream.stream.sse) | `CONTENT_FILTER` 终态分支，包含拒答模板与 `ban_regenerate` |
+| [longtext-deepseek-v4-pro-20260429](../tests/raw_stream_samples/longtext-deepseek-v4-pro-20260429/upstream.stream.sse) | DeepSeek V4 pro 长文本流，包含文件上传上下文和较长 reasoning/content 输出 |
 | [continue-thinking-snapshot-replay-20260405](../tests/raw_stream_samples/continue-thinking-snapshot-replay-20260405/upstream.stream.sse) | 多轮 `completion + continue` 原始流，用于验证接续思考去重 |
 | [markdown-format-example-20260405](../tests/raw_stream_samples/markdown-format-example-20260405/upstream.stream.sse) | Markdown 输出的早期样本，用于观察 token 级输出形态 |
 | [markdown-format-example-20260405-spacefix](../tests/raw_stream_samples/markdown-format-example-20260405-spacefix/upstream.stream.sse) | Markdown 输出修正样本，用于验证空格 chunk 必须保留 |
@@ -194,7 +195,7 @@ close
 ## 8. 终态行为
-当前 corpus 里有两条很重要的终态分支。
+当前 corpus 直接覆盖正常完成和 continue 接续；当前实现还兼容 `CONTENT_FILTER` 风控终态，相关分支由协议观察与兼容性 fixture 继续守护。
 ### 8.1 正常完成
@@ -208,7 +209,7 @@ close
 ### 8.2 风控终态
-`content-filter-trigger-20260405-jwt3` 展示了另一种终态路径：
+`CONTENT_FILTER` 不在当前 raw stream corpus 的目录样本中，但代码和兼容性测试仍按下面这种终态路径处理：
 1. 先继续输出一段正常正文。
 2. 出现提示类 fragment，例如 `TIP`。
@@ -309,7 +310,18 @@ parse SSE block
 - 新模型可能增加新的 `p` 路径。
 - 新版本可能增加新的 fragment.type。
 - `CONTENT_FILTER` 的终态模板内容可能变化。
- 自动续写相关状态（如 `INCOMPLETE` / `AUTO_CONTINUE`）当前主要来自实测与实现兼容逻辑，后续字段形态仍可能变化。
+- 自动续写相关状态（如 `INCOMPLETE` / `AUTO_CONTINUE`）当前主要来自实测与实现兼容逻辑，后续字段形态仍可能变化。当前实现不会仅因早期 `WIP` 状态就自动继续；只有显式 `INCOMPLETE` 或 `auto_continue` 信号才会触发 continue。
 - 解析器应当对未知字段、未知路径、未知事件保持容忍。
 如果你要把这份说明用于实际开发，建议同时保留原始流样本、回放脚本和回归测试，不要只依赖本文。
 ## 2026-04-29 最近线上样本增量观察
 基于 `longtext-deepseek-v4-flash-20260429` 与 `longtext-deepseek-v4-pro-20260429` 两个真实账号长文本样本，近期格式变化要点如下：
 1. `data:` 事件中仍大量出现 `{"v":"..."}` 的无路径增量（`p` 缺失），解析器必须把空路径视为可见正文候选，而不能只依赖 `response/content`。
 2. 对象形态 `v`（如 `{"text":"..."}` / `{"content":"..."}`）仍会出现，且可能与无路径 chunk 混用；仅按字符串处理会导致正文丢块。
 3. 多轮 continuation 场景下，后续 chunk 可能不再重复显式 `status`，状态机需要保留上一轮 `INCOMPLETE` 语义直到出现终态。
 4. 2026-04-29 起客户端头部版本基线上调到 `x-client-version: 2.0.3`，否则部分账号会出现上游行为不一致（包括空输出与补轮异常）。
 建议：新增样本默认回放应优先覆盖「长文本 + 多轮 + 无路径 chunk」组合，避免只用短样本导致回归漏检。
--- a/docs/README.md
+++ b/docs/README.md
@@ -16,13 +16,14 @@
 ### 专题文档
 - [DS2API 项目价值说明](./project-value.md)
 - [API -> 网页对话纯文本兼容主链路说明](./prompt-compatibility.md)
 - [Tool Calling 统一语义](./toolcall-semantics.md)
 - [DeepSeek SSE 行为结构说明（逆向观察）](./DeepSeekSSE行为结构说明-2026-04-05.md)
 ### 文档维护约定
- 文档更新必须以实际代码实现为依据：总路由装配看 `internal/server/router.go`，协议/resource 路由看 `internal/httpapi/*/**/routes.go` 与 `internal/httpapi/admin/handler.go`，配置默认值看 `internal/config/*`，模型/alias 看 `internal/config/models.go`，prompt 兼容链路看 `docs/prompt-compatibility.md` 列出的代码入口。
+- 文档更新必须以实际代码实现为依据：总路由装配看 `internal/server/router.go`，协议/resource 路由看 `internal/httpapi/**/handler*.go` 与 `internal/httpapi/admin/handler.go`，配置默认值看 `internal/config/*`，模型/alias 看 `internal/config/models.go`，prompt 兼容链路看 `docs/prompt-compatibility.md` 列出的代码入口。
 - `README.MD` / `README.en.md`：面向首次接触用户，保留“是什么 + 怎么快速跑起来”。
 - `docs/ARCHITECTURE*.md`：面向开发者，集中维护项目结构、模块职责与调用链。
 - `API*.md`：面向客户端接入者，聚焦接口行为、鉴权和示例。
@@ -47,13 +48,14 @@ Recommended reading order:
 ### Topical docs
 - [DS2API project value note](./project-value.md)
 - [API -> pure-text web-chat compatibility pipeline](./prompt-compatibility.md)
 - [Tool-calling unified semantics](./toolcall-semantics.md)
 - [DeepSeek SSE behavior notes (reverse-engineered)](./DeepSeekSSE行为结构说明-2026-04-05.md)
 ### Maintenance conventions
- Documentation updates must be grounded in the actual implementation: root routing lives in `internal/server/router.go`, protocol/resource routes live in `internal/httpapi/*/**/routes.go` and `internal/httpapi/admin/handler.go`, config defaults in `internal/config/*`, models/aliases in `internal/config/models.go`, and the prompt compatibility pipeline in the code entrypoints listed by `docs/prompt-compatibility.md`.
+- Documentation updates must be grounded in the actual implementation: root routing lives in `internal/server/router.go`, protocol/resource routes live in `internal/httpapi/**/handler*.go` and `internal/httpapi/admin/handler.go`, config defaults in `internal/config/*`, models/aliases in `internal/config/models.go`, and the prompt compatibility pipeline in the code entrypoints listed by `docs/prompt-compatibility.md`.
 - `README.MD` / `README.en.md`: onboarding-oriented (“what + quick start”).
 - `docs/ARCHITECTURE*.md`: developer-oriented source of truth for module boundaries and execution flow.
 - `API*.md`: integration-oriented behavior/contracts.
--- a/docs/TESTING.md
+++ b/docs/TESTING.md
@@ -60,11 +60,10 @@ npm run build --prefix webui
 ./tests/scripts/check-refactor-line-gate.sh
 ./tests/scripts/check-node-split-syntax.sh
 ./tests/scripts/check-cross-build.sh
 # 历史阶段门禁：阶段 6 手工烟测签字检查（默认读取 plans/stage6-manual-smoke.md）
 ./tests/scripts/check-stage6-manual-smoke.sh
 ```
 说明：`plans/stage6-manual-smoke.md` 已移除，阶段 6 手工烟测不再作为当前 CI 或发布门禁。
 ### 端到端测试 | End-to-End Tests
 ```bash
--- a/docs/project-value.md
+++ b/docs/project-value.md
@@ -0,0 +1,119 @@
 # DS2API 项目价值说明
 文档导航：[总览](../README.MD) / [文档索引](./README.md) / [接口文档](../API.md) / [兼容主链路](./prompt-compatibility.md) / [Tool Calling 语义](./toolcall-semantics.md)
 > 本文用于说明 DS2API 的项目定位与长期价值。
 > 它不是架构说明，也不是功能清单，而是从“网页能力如何稳定 API 化”这个角度解释本项目为什么成立。
 ## 1. 项目定位
 DS2API 的定位不是“又一个 API 代理”，也不是训练工具。
 它本质上是一个网页转 API 的兼容层：把 DeepSeek 网页对话侧可用的能力，整理成 OpenAI / Claude / Gemini 风格客户端可以接入的请求与响应形态。
 本项目的核心价值在于：
 1. 把 DeepSeek 网页对话能力 API 化。
 2. 把不同客户端协议统一到同一套兼容入口。
 3. 把网页侧会话、thinking、文件引用、流式输出等行为整理成客户端可消费的结果。
 4. 为上层编程工具、自动化工具或外部编排器提供稳定后端。
 ## 2. 解决的问题
 ### 2.1 把网页能力变成可接入的 API 形态
 网页侧能力可以直接对话，但标准客户端需要的是稳定的 API 契约。两者之间有一段天然差距：
 - 输入格式不同
 - 输出事件不同
 - 流式语义不同
 - 文件引用方式不同
 - thinking 与正文的暴露方式不同
 DS2API 通过 `promptcompat`、`completionruntime`、`assistantturn` 和各协议 renderer，把这段差距收敛到一条可维护的主链路中：
 - 请求侧把 OpenAI / Claude / Gemini 消息归一成网页纯文本上下文。
 - 上游侧按 DeepSeek 网页 completion 需要的 payload 发起会话。
 - 输出侧把 DeepSeek SSE 收集或流式事件再渲染回各协议原生形态。
 这才是本项目的主定义：把网页能力稳定转成 API 可消费形态。
 ### 2.2 不只是转发，而是兼容
 普通转发只能把请求送出去，无法处理协议语义之间的差异。DS2API 需要额外处理：
 - 模型 alias 与 DeepSeek 原生模型的映射
 - thinking / reasoning 开关与输出结构
 - search 与 citation / reference 标记
 - 文件上传、历史文件和 current input file
 - 上游空输出、content filter、auto-continue、重试和 usage 估算
 这些都不是“把 URL 改一下”能解决的事情。项目价值正是在这些细节里体现出来。
 ### 2.3 让外部工具链能挂上去
 当用户把 DS2API 接到编程工具、自动化工具或第三方 SDK 时，很多请求会变成长链路任务：
 - 读取文件
 - 搜索上下文
 - 修改代码
 - 执行命令
 - 继续修正
 - 输出最终结果
 DS2API 不直接定义这些外部工具链，但它提供了一个足够稳定的 API 底座，让这些工具链可以外挂在上面继续工作。
 ## 3. 工具调用的价值
 工具调用不是 DS2API 成立的前提，但它是项目很重要的增强能力。
 即使没有工具调用，DS2API 仍然是网页转 API 兼容层；当请求包含工具能力时，项目会额外处理模型输出漂移、长参数和流式防泄漏等问题：
 - 长脚本用 CDATA 保住原文
 - 文件路径和命令参数不容易被转义打坏
 - tool call 语法有统一的 DSML / canonical XML 处理
 - 模型输出漂了也能宽匹配、自修正
 - 流式场景能尽量不把工具块漏回普通文本
 这使 DS2API 可以服务编程工具和 agent 类客户端，但项目主轴仍然是“网页能力 API 化”，不是把工具调用当作项目唯一卖点。
 ## 4. CDATA 的作用
 CDATA 不是项目价值本身，但它是工具调用与长文本兼容中很实用的一部分。
 对本项目这种场景来说，CDATA 的作用很直接：
 - 保护长文本不被转义破坏
 - 保住脚本、命令、代码片段的原样性
 - 让结构化参数和自由文本更稳定地共存
 - 让历史内容更容易被原样回放和再处理
 它的意义不是让协议显得更复杂，而是让内容更少在转写、解析和回放过程中坏掉。
 ## 5. 它不是什么
 为了避免误解，需要明确项目边界：
 - 不是官方 DeepSeek API。
 - 不是训练平台。
 - 不是人工标注系统。
 - 不是独立评测工具。
 - 不是简单反代。
 DS2API 是兼容层。它的职责是把网页能力整理成 API 体验，并在必要时对工具、历史、文件和流式输出做兼容处理。
 ## 6. 长期价值
 DS2API 的长期价值，不在某个单点功能，而在于它把多个难点放进了同一条可维护链路：
 - 多协议入口
 - DeepSeek 网页 completion 适配
 - prompt 纯文本兼容
 - thinking / search / file 引用处理
 - Go / Node 流式输出对齐
 - tool call 解析与防泄漏
 - Admin / WebUI / 账号池 / 并发队列
 如果要用一句话概括它的价值，可以写成：
 **DS2API 的价值，是把 DeepSeek 网页能力稳定整理成标准客户端可以持续使用的 API 形态。**
--- a/docs/prompt-compatibility.md
+++ b/docs/prompt-compatibility.md
@@ -3,7 +3,7 @@
 文档导航：[总览](../README.MD) / [架构说明](./ARCHITECTURE.md) / [接口文档](../API.md) / [测试指南](./TESTING.md)
 > 本文档是 DS2API“把 OpenAI / Claude / Gemini 风格 API 请求兼容成 DeepSeek 网页对话纯文本上下文”的专项说明。
-> 这是项目最重要的兼容产物之一。凡是修改消息标准化、tool prompt 注入、tool history 保留、文件引用、current input file / legacy history_split、下游 completion payload 组装等行为，都必须同步更新本文档。
+> 这是项目最重要的兼容产物之一。凡是修改消息标准化、tool prompt 注入、tool history 保留、文件引用、current input file、下游 completion payload 组装等行为，都必须同步更新本文档。
 ## 1. 核心结论
@@ -45,9 +45,12 @@ DS2API 当前的核心思路，不是把客户端传来的 `messages`、`tools`
  -> promptcompat 统一消息标准化
  -> tool prompt 注入
  -> DeepSeek 风格 prompt 拼装
-  -> 文件收集 / inline 上传 / current input file（OpenAI 链路）
+  -> 文件收集 / inline 上传（OpenAI 文件链路）
  -> current input file（completion runtime 全局入口）
  -> completion payload
  -> 下游网页对话接口
  -> assistantturn 输出语义归一（Go 非流式 + 流式收尾）
  -> 各协议 renderer（OpenAI / Responses / Claude / Gemini）
 ```
 对应的关键代码入口：
@@ -72,6 +75,10 @@ DS2API 当前的核心思路，不是把客户端传来的 `messages`、`tools`
  [internal/promptcompat/thinking_injection.go](../internal/promptcompat/thinking_injection.go)
 - completion payload：
  [internal/promptcompat/standard_request.go](../internal/promptcompat/standard_request.go)
 - Go 输出侧 assistant turn：
  [internal/assistantturn/turn.go](../internal/assistantturn/turn.go)
 - Go completion runtime：
  [internal/completionruntime/nonstream.go](../internal/completionruntime/nonstream.go)
 ## 4. 下游真正收到的东西
@@ -98,13 +105,18 @@ DS2API 当前的核心思路，不是把客户端传来的 `messages`、`tools`
 - `prompt` 才是对话上下文主载体。
 - `ref_file_ids` 只承载文件引用，不承载普通文本消息。
 - `tools` 不会作为“原生工具 schema”直接下发给下游，而是被改写进 `prompt`。
 - 对外返回给客户端的 `prompt_tokens` / `input_tokens` / `promptTokenCount` 不再按“最后一条消息”或字符粗估近似返回，而是基于**完整上下文 prompt**做 tokenizer 计数；为了避免上下文实际超限但客户端误以为还能塞下，请求侧上下文 token 会额外保守上浮一点，宁可略大也不低估。
 - 当前 `/v1/chat/completions` 业务路径仍是“每次请求新建一个远端 `chat_session_id`，并默认发送 `parent_message_id: null`”；因此 DS2API 对外默认表现为“新会话 + prompt 拼历史”，而不是复用 DeepSeek 原生会话树。
 - 但 DeepSeek 远端本身支持同一 `chat_session_id` 的跨轮次持续对话。2026-04-27 已用项目内现有 DeepSeek client 做过一次不改业务代码的双轮实测：同一 `chat_session_id` 下，第 1 轮返回 `request_message_id=1` / `response_message_id=2` / 文本 `SESSION_TEST_ONE`；第 2 轮重新获取一次 PoW，并发送 `parent_message_id=2` 后，成功返回 `request_message_id=3` / `response_message_id=4` / 文本 `SESSION_TEST_TWO`。这说明“同远端会话持续聊天”能力存在，且每轮需要携带正确的 parent/message 链接信息，同时重新获取对应轮次可用的 PoW。
- OpenAI Chat / Responses 原生走统一 OpenAI 标准化与 DeepSeek payload 组装；Claude / Gemini 会尽量复用 OpenAI prompt/tool 语义，其中 Gemini 直接复用 `promptcompat.BuildOpenAIPromptForAdapter`，Claude 消息接口在可代理场景会转换为 OpenAI chat 形态再执行。
+- OpenAI Chat / Responses 原生走统一 OpenAI 标准化与 DeepSeek payload 组装；Claude / Gemini 会尽量复用 OpenAI prompt/tool 语义，其中 Gemini 直接复用 `promptcompat.BuildOpenAIPromptForAdapter`。Go 主服务新增 `completionruntime` 启动层，统一执行 DeepSeek session/PoW/call；输出侧新增 `assistantturn` 语义层：非流式 OpenAI Chat / Responses / Claude / Gemini 会把 DeepSeek SSE 收集结果先归一成同一份 assistant turn，再分别渲染成各协议原生外形；流式 OpenAI Chat / Responses / Claude / Gemini 继续保持各协议实时 SSE framing，但最终收尾的 tool fallback、schema 归一、usage、empty-output / content-filter 错误语义同样由 `assistantturn` 判定。Claude / Gemini 的常规 Go 主路径不再依赖内部 `httptest` 转发到 OpenAI handler；`translatorcliproxy` 仅保留用于 Vercel bridge、后端缺失 fallback 和回归测试，不作为主业务协议转换中心。
- 客户端传入的 thinking / reasoning 开关会被归一到下游 `thinking_enabled`。Gemini `generationConfig.thinkingConfig.thinkingBudget` 会翻译成同一套 thinking 开关；关闭时即使上游返回 `response/thinking_content`，兼容层也不会把它当作可见正文输出。若最终解析出的模型名带 `-nothinking` 后缀，则会无条件强制关闭 thinking，优先级高于请求体中的 `thinking` / `reasoning` / `reasoning_effort`。Claude surface 在流式请求且未显式声明 `thinking` 时，仍按 Anthropic 语义默认关闭；但在非流式代理场景，兼容层会内部开启一次下游 thinking，用于捕获“正文为空、工具调用落在 thinking 里”的情况，随后在回包前剥离用户不可见的 thinking block。
+- Vercel Node 流式路径本轮不迁移，仍使用现有 Node bridge / stream-tool-sieve 实现；后续若变更 Node 流式语义，需要按 `assistantturn` 的 Go canonical 输出语义同步对齐。
- 对 OpenAI Chat / Responses 的非流式收尾，如果最终可见正文为空，兼容层会优先尝试把思维链中的独立 DSML / XML 工具块当作真实工具调用解析出来。流式链路也会在收尾阶段做同样的 fallback 检测，但不会因为思维链内容去中途拦截或改写流式输出；thinking / reasoning 增量仍按原样先发，只有在结束收尾时才可能补发最终工具调用结果。补发结果会作为本轮 assistant 的结构化 `tool_calls` / `function_call` 输出返回，而不是塞进 `content` 文本；如果客户端没有开启 thinking / reasoning，思维链只用于检测，不会作为 `reasoning_content` 或可见正文暴露。只有正文为空且思维链里也没有可执行工具调用时，才继续按空回复错误处理。
+- 客户端传入的 thinking / reasoning 开关会被归一到下游 `thinking_enabled`。Gemini `generationConfig.thinkingConfig.thinkingBudget` 会翻译成同一套 thinking 开关；关闭时即使上游返回 `response/thinking_content`，兼容层也不会把它当作可见正文输出。若最终解析出的模型名带 `-nothinking` 后缀，则会无条件强制关闭 thinking，优先级高于请求体中的 `thinking` / `reasoning` / `reasoning_effort`。未显式关闭时，各 surface 会按解析后的 DeepSeek 模型默认能力开启 thinking，并用各自协议的原生形态暴露：OpenAI Chat 为 `reasoning_content`，OpenAI Responses 为 `response.reasoning.delta` / `reasoning` content，Claude 为 `thinking` block / `thinking_delta`，Gemini 为 `thought: true` part。
 - 对 OpenAI Chat / Responses 的非流式收尾，如果最终可见正文为空，兼容层会优先尝试把思维链中的独立 DSML / XML 工具块当作真实工具调用解析出来。流式链路也会在收尾阶段做同样的 fallback 检测，但不会因为思维链内容去中途拦截或改写流式输出；真正的工具识别始终基于原始上游文本，而不是基于“已经做过可见输出清洗”的版本，因此即使最终可见层会剥离完整 leaked DSML / XML `tool_calls` wrapper、并抑制全空参数或无效 wrapper 块，也不会影响真实工具调用转成结构化 `tool_calls` / `function_call`。补发结果会作为本轮 assistant 的结构化 `tool_calls` / `function_call` 输出返回，而不是塞进 `content` 文本；如果客户端没有开启 thinking / reasoning，思维链只用于检测，不会作为 `reasoning_content` 或可见正文暴露。只有正文为空且思维链里也没有可执行工具调用时，才继续按空回复错误处理。
 - OpenAI Chat / Responses 的空回复错误处理之前会默认做一次内部补偿重试：第一次上游完整结束后，如果最终可见正文为空、没有解析到工具调用、也没有已经向客户端流式发出工具调用，并且终止原因不是 `content_filter`，兼容层会复用同一个 `chat_session_id`、账号、token 与工具策略，把原始 completion `prompt` 追加固定后缀 `Previous reply had no visible output. Please regenerate the visible final answer or tool call now.` 后重新提交一次。重试遵循 DeepSeek 多轮对话协议：从第一次上游 SSE 流中提取 `response_message_id`，并在重试 payload 中设置 `parent_message_id` 为该值，使重试成为同一会话的后续轮次而非断裂的根消息；同时重新获取一次 PoW（若 PoW 获取失败则回退到原始 PoW）。该重试不会重新标准化消息、不会新建 session、不会切换账号，也不会向流式客户端插入重试标记；第二次 thinking / reasoning 会按正常增量直接接到第一次之后，并继续使用 overlap trim 去重。若第二次仍为空，终端错误码仍保持现有 `upstream_empty_output`；若任一尝试触发空 `content_filter`，不做补偿重试并保持 `content_filter` 错误。JS Vercel 运行时同样设置 `parent_message_id`，但因无法直接调用 PoW API 而复用原始 PoW。
 - 非流式 OpenAI Chat / Responses、Claude Messages、Gemini generateContent 在最终可见正文渲染阶段，会把 DeepSeek 搜索返回中的 `[citation:N]` / `[reference:N]` 标记替换成对应 Markdown 链接。`citation` 标记按一基序号解析；`reference` 标记只有在同一段正文中出现 `[reference:0]`（允许冒号后有空格）时才按零基序号映射，并且不会影响同段正文里的 `citation` 标记。
 - 流式输出仍默认隐藏 `[citation:N]` / `[reference:N]` 这类上游内部标记，避免分片输出中泄漏尚未完成映射的引用占位符。
 ## 5. prompt 是怎么拼出来的
 OpenAI Chat / Responses 在标准化后、current input file 之前，会默认执行 `thinking_injection` 增强。它参考 DeepSeek V4 “把控制指令放在 user 消息末尾更稳定”的用法，在最新 user message 后追加思考增强提示词。当前内置默认提示词以 `Reasoning Effort: Absolute maximum with no shortcuts permitted.` 开头，并继续要求模型充分分解问题、覆盖潜在路径与边界条件、把完整推演过程显式写出。该开关默认启用，可通过 `thinking_injection.enabled=false` 关闭；也可以通过 `thinking_injection.prompt` 自定义提示词，留空时使用内置默认提示词。
@@ -114,6 +126,11 @@ OpenAI Chat / Responses 在标准化后、current input file 之前，会默认
 - 普通请求会直接出现在最终 `prompt` 的最新 user block 末尾。
 - 如果触发 current input file，它会进入完整上下文文件中。
 另外，`MessagesPrepareWithThinking` 还会在最终 prompt 的最前面预置一段固定的 system 级“输出完整性约束（Output integrity guard）”：
 - 如果上游上下文、工具输出或解析后的文本出现乱码、损坏、部分解析、重复或其他畸形片段，不要模仿、不要回显，只输出给用户的正确内容。
 - 这段约束位于普通 system / tool prompt 之前，因此是当前最终 prompt 里的最高优先级前置指令。
 ### 5.1 角色标记
 最终 prompt 使用 DeepSeek 风格角色标记：
@@ -151,12 +168,14 @@ OpenAI Chat / Responses 在标准化后、current input file 之前，会默认
 4. 把这整段内容并入 system prompt。
 工具调用正例现在优先示范官方 DSML 风格：`<|DSML|tool_calls>` → `<|DSML|invoke name="...">` → `<|DSML|parameter name="...">`。
-兼容层仍接受旧式纯 `<tool_calls>` wrapper，但提示词会优先要求模型输出官方 DSML 标签，并强调不能只输出 closing wrapper 而漏掉 opening tag。需要注意：这是“兼容 DSML 外壳，内部仍以 XML 解析语义为准”，不是原生 DSML 全链路实现；DSML 标签会在解析入口归一化回现有 XML 标签后继续走同一套 parser。
+兼容层仍接受旧式纯 `<tool_calls>` wrapper，并会容错若干 DSML 标签变体，包括短横线形式 `<dsml-tool-calls>` / `<dsml-invoke>` / `<dsml-parameter>`；但提示词会优先要求模型输出官方 DSML 标签，并强调不能只输出 closing wrapper 而漏掉 opening tag。需要注意：这是“兼容 DSML 外壳，内部仍以 XML 解析语义为准”，不是原生 DSML 全链路实现；DSML 标签会在解析入口归一化回现有 XML 标签后继续走同一套 parser。
-数组参数使用 `<item>...</item>` 子节点表示；当某个参数体只包含 item 子节点时，Go / Node 解析器会把它还原成数组，避免 `questions` / `options` 这类 schema 中要求 array 的参数被误解析成 `{ "item": ... }` 对象。若模型把完整结构化 XML fragment 误包进 CDATA，兼容层会在保护 `content` / `command` 等原文字段的前提下，尝试把非原文字段中的 CDATA XML fragment 还原成 object / array。不过，如果 CDATA 只是单个平面的 XML/HTML 标签，例如 `<b>urgent</b>` 这种行内标记，兼容层会保留原始字符串，不会强行升成 object / array；只有明显表示结构的 CDATA 片段，例如多兄弟节点、嵌套子节点或 `item` 列表，才会触发结构化恢复。
+数组参数使用 `<item>...</item>` 子节点表示；当某个参数体只包含 item 子节点时，Go / Node 解析器会把它还原成数组，避免 `questions` / `options` 这类 schema 中要求 array 的参数被误解析成 `{ "item": ... }` 对象。除此之外，解析器还会回收一些更松散的列表写法，例如 JSON array 字面量或逗号分隔的 JSON 项序列，只要它们足够明确；但 `<item>` 仍然是首选形态。若模型把完整结构化 XML fragment 误包进 CDATA，兼容层会在保护 `content` / `command` 等原文字段的前提下，尝试把非原文字段中的 CDATA XML fragment 还原成 object / array。不过，如果 CDATA 只是单个平面的 XML/HTML 标签，例如 `<b>urgent</b>` 这种行内标记，兼容层会保留原始字符串，不会强行升成 object / array；只有明显表示结构的 CDATA 片段，例如多兄弟节点、嵌套子节点或 `item` 列表，才会触发结构化恢复。对 `command` / `content` 等长文本参数，CDATA 内部的 Markdown fenced DSML / XML 示例会作为原文保护；示例里的 `]]></parameter>` 或 `</tool_calls>` 不会截断外层工具调用，解析器会继续等待围栏外真正的参数 / wrapper 结束标签。
 Go 侧读取 DeepSeek SSE 时不再依赖 `bufio.Scanner` 的固定 2MiB 单行上限；当写文件类工具把很长的 `content` 放在单个 `data:` 行里返回时，非流式收集、流式解析和 auto-continue 透传都会保留完整行，再进入同一套工具解析与序列化流程。
 在 assistant 最终回包阶段，如果某个 tool 参数在声明 schema 中明确是 `string`，兼容层会在把解析后的 `tool_calls` / `function_call` 重新序列化成 OpenAI / Responses / Claude 可见参数前，递归把该路径上的 number / bool / object / array 统一转成字符串；其中 object / array 会压成紧凑 JSON 字符串。这个保护只对 schema 明确声明为 string 的路径生效，不会改写本来就是 `number` / `boolean` / `object` / `array` 的参数。这样可以兼容 DeepSeek 输出了结构化片段、但上游客户端工具 schema 又严格要求字符串参数的场景（例如 `content`、`prompt`、`path`、`taskId` 等）。
 工具 schema 的权威来源始终是**当前请求实际携带的 schema**，而不是同名工具在其他 runtime（Claude Code / OpenCode / Codex 等）里的默认印象。兼容层现在会同时兼容 OpenAI 风格 `function.parameters`、直接工具对象上的 `parameters` / `input_schema`、以及 camelCase 的 `inputSchema` / `schema`，并在最终输出阶段按这份请求内 schema 决定是保留 array/object，还是仅对明确声明为 `string` 的路径做字符串化。该规则同样适用于 Claude 的流式收尾和 Vercel Node 流式 tool-call formatter，避免不同 runtime 因 schema shape 差异而出现同名工具参数类型漂移。
 正例中的工具名只会来自当前请求实际声明的工具；如果当前请求没有足够的已知工具形态，就省略对应的单工具、多工具或嵌套示例，避免把不可用工具名写进 prompt。
 对执行类工具，脚本内容必须进入执行参数本身：`Bash` / `execute_command` 使用 `command`，`exec_command` 使用 `cmd`；不要把脚本示范成 `path` / `content` 文件写入参数。
 如果当前请求声明了 `Read` / `read_file` 这类读取工具，兼容层会额外注入一条 read-tool cache guard：当读取结果只表示“文件未变更 / 已在历史中 / 请引用先前上下文 / 没有正文内容”时，模型必须把它视为内容不可用，不能反复调用同一个无正文读取；应改为请求完整正文读取能力，或向用户说明需要重新提供文件内容。这个约束只缓解客户端缓存返回空内容导致的死循环，DS2API 不会也无法凭空恢复客户端本地文件正文。
 OpenAI 路径实现：
 [internal/promptcompat/tool_prompt.go](../internal/promptcompat/tool_prompt.go)
@@ -233,6 +252,14 @@ OpenAI 文件相关实现：
 - 文件 ID 收集：
  [internal/promptcompat/file_refs.go](../internal/promptcompat/file_refs.go)
 OpenAI 的文件上传现在不再是“只传文件本体”的通用路径，而是会先根据请求里的 `model` 解析出 DeepSeek 的上传类型，并把它透传到上传接口的 `x-model-type`。当前可见的上传类型就是 `default` / `expert` / `vision`，其中 vision 请求上传图片时必须带上 `vision`，否则下游容易退回到仅文本或 OCR 语义。这个模型类型会同时用于：
 - `/v1/files` 这类独立文件上传入口
 - Chat / Responses 的 inline 图片、附件上传
 - current input file 触发时生成的 `DS2API_HISTORY.txt` 上下文文件
 也就是说，文件上传和完成请求的 `model_type` 现在是一致的：完成 payload 里仍然是 `model_type`，上传文件则会在 DeepSeek 上传阶段携带同样的模型类型信息。
 结论：
 - “systemprompt 文字”在 prompt 里
@@ -242,11 +269,11 @@ OpenAI 文件相关实现：
 ## 9. 多轮历史为什么不会一直完整内联在 prompt
-兼容层现在只保留 `current_input_file` 这一种拆分方式；旧的 `history_split` 已废弃，只保留为兼容旧配置的字段，不再参与请求处理。
+兼容层现在只保留 `current_input_file` 这一种拆分方式；旧的 `history_split` 配置字段已移除，读取旧配置时会忽略它且不会再写回。
- `current_input_file` 默认开启；它用于把“完整上下文”合并进隐藏上下文文件。当最新 user turn 的纯文本长度达到 `current_input_file.min_chars`（默认 `0`）时，兼容层会上传一个文件名为 `IGNORE.txt` 的上下文文件，并在 live prompt 中只保留一个中性的 user 消息要求模型直接回答最新请求，不再暴露文件名或要求模型读取本地文件。
+- `current_input_file` 默认开启；它在统一 completion runtime 入口全局生效，用于把“完整上下文”合并进 `DS2API_HISTORY.txt` 上下文文件。当最新 user turn 的纯文本长度达到 `current_input_file.min_chars`（默认 `0`）时，runtime 会上传一个文件名为 `DS2API_HISTORY.txt` 的上下文文件。文件内容会先经过各协议入口的标准化，再序列化成按轮次编号的 `DS2API_HISTORY.txt` 风格 transcript，带有 `# DS2API_HISTORY.txt` 标题和 `=== N. ROLE ===` 分段；live prompt 中则会给出一个 continuation 语气的 user 消息，引导模型从 `DS2API_HISTORY.txt` 的最新状态继续推进，并直接回答最新请求，避免把任务拉回起点。
 - 如果 `current_input_file.enabled=false`，请求会直接透传，不上传任何拆分上下文文件。
- 旧的 `history_split.enabled` / `history_split.trigger_after_turns` 会被读取进配置对象以保持兼容，但不会触发拆分上传，也不会影响 `current_input_file` 的默认开启。
+- 即使触发 `current_input_file` 后 live prompt 被缩短，对客户端回包里的上下文 token 统计，仍会沿用**拆分前的完整 prompt 语义**做计数，而不是按缩短后的占位 prompt 计算；否则会把真实上下文显著算小。
 相关实现：
@@ -254,19 +281,27 @@ OpenAI 文件相关实现：
  [internal/config/store_accessors.go](../internal/config/store_accessors.go)
 - 当前输入转文件：
  [internal/httpapi/openai/history/current_input_file.go](../internal/httpapi/openai/history/current_input_file.go)
- 旧历史拆分兼容壳：
+- 全局 completion runtime 应用点：
-  [internal/httpapi/openai/history/history_split.go](../internal/httpapi/openai/history/history_split.go)
+  [internal/completionruntime/nonstream.go](../internal/completionruntime/nonstream.go)
-当前输入转文件启用并触发时，上传文件的真实文件名是 `IGNORE.txt`，文件内容是完整 `messages` 上下文；它仍会先用 OpenAI 消息标准化和 DeepSeek 角色标记序列化，再包进 `IGNORE` 文件边界里：
+当前输入转文件启用并触发时，上传文件的真实文件名是 `DS2API_HISTORY.txt`，文件内容是完整 `messages` 上下文；它会使用 OpenAI-compatible 的消息/transcript 序列化规则和 DeepSeek 角色标记，再按轮次编号成 `DS2API_HISTORY.txt` 风格的 transcript（不再注入文件边界标签）：
 ```text
-[uploaded filename]: IGNORE.txt
+[uploaded filename]: DS2API_HISTORY.txt
-[file content end]
+# DS2API_HISTORY.txt
 Prior conversation history and tool progress.
-<｜begin▁of▁sentence｜><｜System｜>...<｜User｜>...<｜Assistant｜>...<｜Tool｜>...<｜User｜>...
+=== 1. SYSTEM ===
 ...
-[file name]: IGNORE
+=== 2. USER ===
-[file content begin]
+...
 === 3. ASSISTANT ===
 ...
 === 4. TOOL ===
 ...
 ```
 开启后，请求的 live prompt 不再直接内联完整上下文，而是保留一个 user role 的短提示，提示模型基于已提供上下文直接回答最新请求；上传后的 `file_id` 会进入 `ref_file_ids`。
@@ -281,7 +316,7 @@ OpenAI 文件相关实现：
 - Responses `instructions` 会 prepend 为 system message
 - `tools` 会注入 system prompt
 - `attachments` / `input_file` / inline 文件会进入 `ref_file_ids`
- current input file 主要在这条链路里生效，旧 `history_split` 仅作兼容字段保留
+- current input file 在统一 completion runtime 入口全局生效
 ### 10.2 Claude Messages
@@ -318,7 +353,7 @@ OpenAI 文件相关实现：
 ```json
 {
-  "prompt": "<｜begin▁of▁sentence｜><｜System｜>原 system / developer\n\nYou have access to these tools: ...<｜end▁of▁instructions｜><｜User｜>The current request and prior conversation context have already been provided. Answer the latest user request directly.<｜Assistant｜>",
+  "prompt": "<｜begin▁of▁sentence｜><｜System｜>原 system / developer\n\nYou have access to these tools: ...<｜end▁of▁instructions｜><｜User｜>Continue from the latest state in the attached DS2API_HISTORY.txt context. Treat it as the current working state and answer the latest user request directly.<｜Assistant｜>",
  "ref_file_ids": [
    "file-current-input-ignore",
    "file-systemprompt",
@@ -333,7 +368,7 @@ OpenAI 文件相关实现：
 - 大部分结构化语义被压进 `prompt`
 - 文件保持文件
- 需要时把完整上下文拆进隐藏上下文文件
+- 需要时把完整上下文拆进 `DS2API_HISTORY.txt` 上下文文件，并按轮次编号成 transcript
 ## 12. 修改时必须同步本文档的场景
@@ -346,8 +381,8 @@ OpenAI 文件相关实现：
 - tool result 注入方式变更
 - tool prompt 模板或 tool_choice 约束变更
 - inline 文件上传 / 文件引用收集规则变更
- current input file 触发条件、上传格式、`IGNORE` 包装格式变更
+- current input file 触发条件、上传格式、`DS2API_HISTORY.txt` transcript 结构变更
- 旧 `history_split` 兼容逻辑的读取、忽略或退化行为变更
+- 旧 `history_split` 字段忽略/清理行为变更
 - completion payload 字段语义变更
 - Claude / Gemini 对这套统一语义的复用关系变更
@@ -359,7 +394,8 @@ OpenAI 文件相关实现：
 - `internal/promptcompat/tool_prompt.go`
 - `internal/httpapi/openai/files/file_inline_upload.go`
 - `internal/promptcompat/file_refs.go`
- `internal/httpapi/openai/history/history_split.go`
+- `internal/httpapi/openai/history/current_input_file.go`
 - `internal/completionruntime/nonstream.go`
 - `internal/promptcompat/responses_input_normalize.go`
 - `internal/httpapi/claude/standard_request.go`
 - `internal/httpapi/claude/handler_utils.go`
--- a/docs/toolcall-semantics.md
+++ b/docs/toolcall-semantics.md
@@ -26,7 +26,7 @@
 </tool_calls>
 ```
-这不是原生 DSML 全链路实现。DSML 只作为 prompt 外壳和解析入口别名；进入 parser 前会被归一化成 `<tool_calls>` / `<invoke>` / `<parameter>`，内部仍以现有 XML 解析语义为准。
+这不是原生 DSML 全链路实现。DSML 主要用于让模型有意识地输出协议标识，隔离普通 XML 语义；进入 parser 前会按固定本地标签名归一化成 `<tool_calls>` / `<invoke>` / `<parameter>`，内部仍以现有 XML 解析语义为准。
 约束：
@@ -39,7 +39,8 @@
 兼容修复：
 - 如果模型漏掉 opening wrapper，但后面仍输出了一个或多个 invoke 并以 closing wrapper 收尾，Go 解析链路会在解析前补回缺失的 opening wrapper。
- 如果模型把 DSML 标签里的分隔符 `|` 写漏成空格（例如 `<|DSML tool_calls>` / `<|DSML invoke>` / `<|DSML parameter>`，或无 leading pipe 的 `<DSML tool_calls>` 形态），或把 `DSML` 与工具标签名直接黏连（例如 `<DSMLtool_calls>` / `<DSMLinvoke>` / `<DSMLparameter>`），或把最前面的 pipe 误写成全宽竖线（例如 `<｜DSML|tool_calls>` / `<｜DSML|invoke>` / `<｜DSML|parameter>`），Go / Node 会在固定工具标签名范围内归一化；相似但非工具标签名（如 `tool_calls_extra`）仍按普通文本处理。
+- Go / Node 解析层不再枚举每一种 DSML typo。它会把工具标签名前的 `DSML`、管道符 `|` / `｜`、空白、重复 leading `<` 视为可容忍的协议噪声，然后只匹配固定本地标签名 `tool_calls` / `invoke` / `parameter`。例如 `<DSML|tool_calls>`、`<<|DSML|tool_calls>`、`<|DSML tool_calls>`、`<DSMLtool_calls>`、`<<DSML|DSML|tool_calls>` 都会归一化；相似但非固定标签名（如 `tool_calls_extra`）仍按普通文本处理。
 - 如果模型在固定工具标签名后多输出一个尾部管道符，例如 `<|DSML|tool_calls|` / `<|DSML|invoke|` / `<|DSML|parameter|`，兼容层会把这个尾部 `|` 当作异常标签终止符并补齐缺失的 `>`；如果后面已经有 `>`，也会消费这个多余 `|` 后再归一化。
 - 这是一个针对常见模型失误的窄修复，不改变推荐输出格式；prompt 仍要求模型直接输出完整 DSML 外壳。
 - 裸 `<invoke ...>` / `<parameter ...>` 不会被当成“已支持的工具语法”；只有 `tool_calls` wrapper 或可修复的缺失 opening wrapper 才会进入工具调用路径。
@@ -53,14 +54,16 @@
 在流式链路中（Go / Node 一致）：
- DSML `<|DSML|tool_calls>` wrapper、兼容变体（`<dsml|tool_calls>`、`<｜tool_calls>`、`<|tool_calls>`、`<｜DSML|tool_calls>`）、窄容错空格分隔形态（如 `<|DSML tool_calls>`）、黏连形态（如 `<DSMLtool_calls>`）和 canonical `<tool_calls>` wrapper 都会进入结构化捕获
+- DSML `<|DSML|tool_calls>` wrapper、短横线形式（如 `<dsml-tool-calls>` / `<dsml-invoke>` / `<dsml-parameter>`）、基于固定本地标签名的 DSML 噪声容错形态、尾部管道符形态（如 `<|DSML|tool_calls|`）和 canonical `<tool_calls>` wrapper 都会进入结构化捕获
 - 如果流里直接从 invoke 开始，但后面补上了 closing wrapper，Go 流式筛分也会按缺失 opening wrapper 的修复路径尝试恢复
 - 已识别成功的工具调用不会再次回流到普通文本
 - 不符合新格式的块不会执行，并继续按原样文本透传
 - fenced code block（反引号 `` ``` `` 和波浪线 `~~~`）中的 XML 示例始终按普通文本处理
 - 支持嵌套围栏（如 4 反引号嵌套 3 反引号）和 CDATA 内围栏保护
 - 对 `command` / `content` 等长文本参数，CDATA 内部如果包含 Markdown fenced DSML / XML 示例，即使示例里出现 `]]></parameter>` / `</tool_calls>` 这类看起来像外层结束标签的片段，也会继续按参数原文保留，直到真正位于围栏外的外层结束标签
 - 如果模型把 `<![CDATA[` 打开后却没有闭合，流式扫描阶段仍会保守地继续缓冲，不会误把 CDATA 里的示例 XML 当成真实工具调用；在最终 parse / flush 恢复阶段，会对这类 loose CDATA 做窄修复，尽量保住外层已完整包裹的真实工具调用
 - 当文本中 mention 了某种标签名（如 `<dsml|tool_calls>` 或 Markdown inline code 里的 `<|DSML|tool_calls>`）而后面紧跟真正工具调用时，sieve 会跳过不可解析的 mention 候选并继续匹配后续真实工具块，不会因 mention 导致工具调用丢失，也不会截断 mention 后的正文
 - Go 侧 SSE 读取不再使用 `bufio.Scanner` 的固定 token 上限；单个 `data:` 行中包含很长的写文件参数时，非流式收集、流式解析与 auto-continue 透传都应保留完整行，再交给 tool parser 处理
 另外，`<parameter>` 的值如果本身是合法 JSON 字面量，也会按结构化值解析，而不是一律保留为字符串。例如 `123`、`true`、`null`、`[1,2]`、`{"a":1}` 都会还原成对应的 number / boolean / null / array / object。
 结构化 XML 参数也会还原为 JSON 结构：如果参数体只包含一个或多个 `<item>...</item>` 子节点，会输出数组；嵌套对象里的 item-only 字段也同样按数组处理。例如 `<parameter name="questions"><item><question>...</question></item></parameter>` 会输出 `{"questions":[{"question":"..."}]}`，而不是 `{"questions":{"item":...}}`。
@@ -94,7 +97,7 @@ node --test tests/node/stream-tool-sieve.test.js
 - DSML `<|DSML|tool_calls>` wrapper 正常解析
 - legacy canonical `<tool_calls>` wrapper 正常解析
- 别名变体（`<dsml|tool_calls>`、`<｜tool_calls>`、`<|tool_calls>`）、DSML 空格分隔 typo（如 `<|DSML tool_calls>`）和黏连 typo（如 `<DSMLtool_calls>`）正常解析
+- 固定本地标签名的 DSML 噪声容错形态（如 `<DSML|tool_calls>`、`<<|DSML|tool_calls>`、`<|DSML tool_calls>`、`<DSMLtool_calls>`、`<<DSML|DSML|tool_calls>`）正常解析
 - 混搭标签（DSML wrapper + canonical inner）归一化后正常解析
 - 波浪线围栏 `~~~` 内的示例不执行
 - 嵌套围栏（4 反引号嵌套 3 反引号）内的示例不执行
--- a/go.mod
+++ b/go.mod
@@ -6,10 +6,13 @@ require (
 	github.com/andybalholm/brotli v1.2.1
 	github.com/go-chi/chi/v5 v5.2.5
 	github.com/google/uuid v1.6.0
 	github.com/hupe1980/go-tiktoken v0.0.10
 	github.com/refraction-networking/utls v1.8.2
 	github.com/router-for-me/CLIProxyAPI/v6 v6.9.14
 )
 require github.com/dlclark/regexp2 v1.11.5 // indirect
 require (
 	github.com/klauspost/compress v1.18.5 // indirect
 	github.com/sirupsen/logrus v1.9.4 // indirect
--- a/go.sum
+++ b/go.sum
@@ -2,10 +2,14 @@ github.com/andybalholm/brotli v1.2.1 h1:R+f5xP285VArJDRgowrfb9DqL18yVK0gKAW/F+eT
 github.com/andybalholm/brotli v1.2.1/go.mod h1:rzTDkvFWvIrjDXZHkuS16NPggd91W3kUSvPlQ1pLaKY=
 github.com/davecgh/go-spew v1.1.1 h1:vj9j/u1bqnvCEfJOwUhtlOARqs3+rkHYY13jYWTU97c=
 github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
 github.com/dlclark/regexp2 v1.11.5 h1:Q/sSnsKerHeCkc/jSTNq1oCm7KiVgUMZRDUoRu0JQZQ=
 github.com/dlclark/regexp2 v1.11.5/go.mod h1:DHkYz0B9wPfa6wondMfaivmHpzrQ3v9q8cnmRbL6yW8=
 github.com/go-chi/chi/v5 v5.2.5 h1:Eg4myHZBjyvJmAFjFvWgrqDTXFyOzjj7YIm3L3mu6Ug=
 github.com/go-chi/chi/v5 v5.2.5/go.mod h1:X7Gx4mteadT3eDOMTsXzmI4/rwUpOwBHLpAfupzFJP0=
 github.com/google/uuid v1.6.0 h1:NIvaJDMOsjHA8n1jAhLSgzrAzy1Hgr+hNrb57e+94F0=
 github.com/google/uuid v1.6.0/go.mod h1:TIyPZe4MgqvfeYDBFedMoGGpEw/LqOeaOT+nhxU+yHo=
 github.com/hupe1980/go-tiktoken v0.0.10 h1:m6phOJaGyctqWdGIgwn9X8AfJvaG74tnQoDL+ntOUEQ=
 github.com/hupe1980/go-tiktoken v0.0.10/go.mod h1:NME6d8hrE+Jo+kLUZHhXShYV8e40hYkm4BbSLQKtvAo=
 github.com/klauspost/compress v1.18.5 h1:/h1gH5Ce+VWNLSWqPzOVn6XBO+vJbCNGvjoaGBFW2IE=
 github.com/klauspost/compress v1.18.5/go.mod h1:cwPg85FWrGar70rWktvGQj8/hthj3wpl0PGDogxkrSQ=
 github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM=
@@ -37,6 +41,8 @@ golang.org/x/net v0.52.0 h1:He/TN1l0e4mmR3QqHMT2Xab3Aj3L9qjbhRm78/6jrW0=
 golang.org/x/net v0.52.0/go.mod h1:R1MAz7uMZxVMualyPXb+VaqGSa3LIaUqk0eEt3w36Sw=
 golang.org/x/sys v0.42.0 h1:omrd2nAlyT5ESRdCLYdm3+fMfNFE/+Rf4bDIQImRJeo=
 golang.org/x/sys v0.42.0/go.mod h1:4GL1E5IUh+htKOUEOaiffhrAeqysfVGipDYzABqnCmw=
 golang.org/x/text v0.35.0 h1:JOVx6vVDFokkpaq1AEptVzLTpDe9KGpj5tR4/X+ybL8=
 golang.org/x/text v0.35.0/go.mod h1:khi/HExzZJ2pGnjenulevKNX1W67CUy0AsXcNubPGCA=
 gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405 h1:yhCVgyC4o1eVCa2tZl7eS0r+SDo693bJlVdllGtEeKM=
 gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
 gopkg.in/yaml.v3 v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA=
--- a/internal/assistantturn/stream.go
+++ b/internal/assistantturn/stream.go
@@ -0,0 +1,64 @@
 package assistantturn
 import (
 	"ds2api/internal/httpapi/openai/shared"
 	"ds2api/internal/sse"
 )
 type StreamEventType string
 const (
 	StreamEventTextDelta     StreamEventType = "text_delta"
 	StreamEventThinkingDelta StreamEventType = "thinking_delta"
 	StreamEventToolCall      StreamEventType = "tool_call"
 	StreamEventDone          StreamEventType = "done"
 	StreamEventError         StreamEventType = "error"
 	StreamEventPing          StreamEventType = "ping"
 )
 type StreamEvent struct {
 	Type     StreamEventType
 	Text     string
 	Thinking string
 	ToolCall any
 	Error    *OutputError
 	Usage    *Usage
 }
 type Accumulator struct {
 	inner shared.StreamAccumulator
 }
 type AccumulatorOptions struct {
 	ThinkingEnabled       bool
 	SearchEnabled         bool
 	StripReferenceMarkers bool
 }
 func NewAccumulator(opts AccumulatorOptions) *Accumulator {
 	return &Accumulator{
 		inner: shared.StreamAccumulator{
 			ThinkingEnabled:       opts.ThinkingEnabled,
 			SearchEnabled:         opts.SearchEnabled,
 			StripReferenceMarkers: opts.StripReferenceMarkers,
 		},
 	}
 }
 func (a *Accumulator) Apply(parsed sse.LineResult) shared.StreamAccumulatorResult {
 	if a == nil {
 		return shared.StreamAccumulatorResult{}
 	}
 	return a.inner.Apply(parsed)
 }
 func (a *Accumulator) Snapshot() (rawText, text, rawThinking, thinking, detectionThinking string) {
 	if a == nil {
 		return "", "", "", "", ""
 	}
 	return a.inner.RawText.String(),
 		a.inner.Text.String(),
 		a.inner.RawThinking.String(),
 		a.inner.Thinking.String(),
 		a.inner.ToolDetectionThinking.String()
 }
--- a/internal/assistantturn/turn.go
+++ b/internal/assistantturn/turn.go
@@ -0,0 +1,285 @@
 package assistantturn
 import (
 	"net/http"
 	"strings"
 	"ds2api/internal/httpapi/openai/shared"
 	"ds2api/internal/promptcompat"
 	"ds2api/internal/sse"
 	"ds2api/internal/toolcall"
 	"ds2api/internal/util"
 )
 type StopReason string
 const (
 	StopReasonStop          StopReason = "stop"
 	StopReasonToolCalls     StopReason = "tool_calls"
 	StopReasonContentFilter StopReason = "content_filter"
 	StopReasonError         StopReason = "error"
 )
 type Usage struct {
 	InputTokens     int
 	OutputTokens    int
 	ReasoningTokens int
 	TotalTokens     int
 }
 type OutputError struct {
 	Status  int
 	Message string
 	Code    string
 }
 type Turn struct {
 	Model             string
 	Prompt            string
 	RawText           string
 	RawThinking       string
 	DetectionThinking string
 	Text              string
 	Thinking          string
 	ToolCalls         []toolcall.ParsedToolCall
 	ParsedToolCalls   toolcall.ToolCallParseResult
 	CitationLinks     map[int]string
 	ContentFilter     bool
 	ResponseMessageID int
 	StopReason        StopReason
 	Usage             Usage
 	Error             *OutputError
 }
 type FinalizeOptions struct {
 	AlreadyEmittedToolCalls bool
 }
 type FinalOutcome struct {
 	FinishReason     string
 	Error            *OutputError
 	Usage            Usage
 	HasToolCalls     bool
 	HasVisibleText   bool
 	HasVisibleOutput bool
 	ShouldFail       bool
 }
 type BuildOptions struct {
 	Model                 string
 	Prompt                string
 	RefFileTokens         int
 	SearchEnabled         bool
 	StripReferenceMarkers bool
 	ToolNames             []string
 	ToolsRaw              any
 	ToolChoice            promptcompat.ToolChoicePolicy
 }
 type StreamSnapshot struct {
 	RawText               string
 	VisibleText           string
 	RawThinking           string
 	VisibleThinking       string
 	DetectionThinking     string
 	ContentFilter         bool
 	CitationLinks         map[int]string
 	ResponseMessageID     int
 	AlreadyEmittedCalls   bool
 	AdditionalToolCalls   []toolcall.ParsedToolCall
 	AlreadyEmittedToolRaw bool
 }
 func BuildTurnFromCollected(result sse.CollectResult, opts BuildOptions) Turn {
 	thinking := shared.CleanVisibleOutput(result.Thinking, opts.StripReferenceMarkers)
 	text := shared.CleanVisibleOutput(result.Text, opts.StripReferenceMarkers)
 	if opts.SearchEnabled {
 		text = shared.ReplaceCitationMarkersWithLinks(text, result.CitationLinks)
 	}
 	parsed := shared.DetectAssistantToolCalls(result.Text, text, result.Thinking, result.ToolDetectionThinking, opts.ToolNames)
 	calls := toolcall.NormalizeParsedToolCallsForSchemas(parsed.Calls, opts.ToolsRaw)
 	parsed.Calls = calls
 	stopReason := StopReasonStop
 	if result.ContentFilter {
 		stopReason = StopReasonContentFilter
 	}
 	if len(calls) > 0 {
 		stopReason = StopReasonToolCalls
 	}
 	turn := Turn{
 		Model:             opts.Model,
 		Prompt:            opts.Prompt,
 		RawText:           result.Text,
 		RawThinking:       result.Thinking,
 		DetectionThinking: result.ToolDetectionThinking,
 		Text:              text,
 		Thinking:          thinking,
 		ToolCalls:         calls,
 		ParsedToolCalls:   parsed,
 		CitationLinks:     result.CitationLinks,
 		ContentFilter:     result.ContentFilter,
 		ResponseMessageID: result.ResponseMessageID,
 		StopReason:        stopReason,
 	}
 	turn.Usage = BuildUsage(opts.Model, opts.Prompt, thinking, text, opts.RefFileTokens)
 	turn.Error = ValidateTurn(turn, opts.ToolChoice)
 	if turn.Error != nil {
 		turn.StopReason = StopReasonError
 	}
 	return turn
 }
 func BuildTurnFromStreamSnapshot(snapshot StreamSnapshot, opts BuildOptions) Turn {
 	thinking := shared.CleanVisibleOutput(snapshot.VisibleThinking, opts.StripReferenceMarkers)
 	text := shared.CleanVisibleOutput(snapshot.VisibleText, opts.StripReferenceMarkers)
 	if opts.SearchEnabled {
 		text = shared.ReplaceCitationMarkersWithLinks(text, snapshot.CitationLinks)
 	}
 	parsed := shared.DetectAssistantToolCalls(snapshot.RawText, text, snapshot.RawThinking, snapshot.DetectionThinking, opts.ToolNames)
 	calls := parsed.Calls
 	if len(calls) == 0 && len(snapshot.AdditionalToolCalls) > 0 {
 		calls = snapshot.AdditionalToolCalls
 	}
 	calls = toolcall.NormalizeParsedToolCallsForSchemas(calls, opts.ToolsRaw)
 	parsed.Calls = calls
 	stopReason := StopReasonStop
 	if snapshot.ContentFilter {
 		stopReason = StopReasonContentFilter
 	}
 	if len(calls) > 0 || snapshot.AlreadyEmittedCalls || snapshot.AlreadyEmittedToolRaw {
 		stopReason = StopReasonToolCalls
 	}
 	turn := Turn{
 		Model:             opts.Model,
 		Prompt:            opts.Prompt,
 		RawText:           snapshot.RawText,
 		RawThinking:       snapshot.RawThinking,
 		DetectionThinking: snapshot.DetectionThinking,
 		Text:              text,
 		Thinking:          thinking,
 		ToolCalls:         calls,
 		ParsedToolCalls:   parsed,
 		CitationLinks:     snapshot.CitationLinks,
 		ContentFilter:     snapshot.ContentFilter,
 		ResponseMessageID: snapshot.ResponseMessageID,
 		StopReason:        stopReason,
 	}
 	turn.Usage = BuildUsage(opts.Model, opts.Prompt, thinking, text, opts.RefFileTokens)
 	if !snapshot.AlreadyEmittedCalls && !snapshot.AlreadyEmittedToolRaw {
 		turn.Error = ValidateTurn(turn, opts.ToolChoice)
 	}
 	if turn.Error != nil && len(calls) == 0 {
 		turn.StopReason = StopReasonError
 	}
 	return turn
 }
 func BuildUsage(model, prompt, thinking, text string, refFileTokens int) Usage {
 	inputTokens := util.CountPromptTokens(prompt, model) + refFileTokens
 	reasoningTokens := util.CountOutputTokens(thinking, model)
 	outputTokens := reasoningTokens + util.CountOutputTokens(text, model)
 	return Usage{
 		InputTokens:     inputTokens,
 		OutputTokens:    outputTokens,
 		ReasoningTokens: reasoningTokens,
 		TotalTokens:     inputTokens + outputTokens,
 	}
 }
 func ValidateTurn(turn Turn, policy promptcompat.ToolChoicePolicy) *OutputError {
 	if policy.IsRequired() && len(turn.ToolCalls) == 0 {
 		return &OutputError{
 			Status:  http.StatusUnprocessableEntity,
 			Message: "tool_choice requires at least one valid tool call.",
 			Code:    "tool_choice_violation",
 		}
 	}
 	if len(turn.ToolCalls) > 0 {
 		return nil
 	}
 	if strings.TrimSpace(turn.Text) != "" {
 		return nil
 	}
 	status, message, code := UpstreamEmptyOutputDetail(turn.ContentFilter, turn.Text, turn.Thinking)
 	return &OutputError{Status: status, Message: message, Code: code}
 }
 func UpstreamEmptyOutputDetail(contentFilter bool, text, thinking string) (int, string, string) {
 	_ = text
 	if contentFilter {
 		return http.StatusBadRequest, "Upstream content filtered the response and returned no output.", "content_filter"
 	}
 	if strings.TrimSpace(thinking) != "" {
 		return http.StatusTooManyRequests, "Upstream account hit a rate limit and returned reasoning without visible output.", "upstream_empty_output"
 	}
 	return http.StatusTooManyRequests, "Upstream account hit a rate limit and returned empty output.", "upstream_empty_output"
 }
 // ShouldRetryEmptyOutput returns true when the turn produced no visible text
 // and has no tool calls or content filter. This includes thinking-only responses,
 // where the model returned reasoning but no answer — a retry may yield text.
 func ShouldRetryEmptyOutput(turn Turn, attempts, maxAttempts int) bool {
 	return attempts < maxAttempts &&
 		!turn.ContentFilter &&
 		len(turn.ToolCalls) == 0 &&
 		strings.TrimSpace(turn.Text) == ""
 }
 func FinalizeTurn(turn Turn, opts FinalizeOptions) FinalOutcome {
 	hasToolCalls := len(turn.ToolCalls) > 0 || opts.AlreadyEmittedToolCalls
 	hasVisibleText := strings.TrimSpace(turn.Text) != ""
 	hasVisibleThinking := strings.TrimSpace(turn.Thinking) != ""
 	err := turn.Error
 	if hasToolCalls {
 		err = nil
 	}
 	finishReason := FinishReason(turn)
 	if hasToolCalls {
 		finishReason = "tool_calls"
 	}
 	return FinalOutcome{
 		FinishReason:     finishReason,
 		Error:            err,
 		Usage:            turn.Usage,
 		HasToolCalls:     hasToolCalls,
 		HasVisibleText:   hasVisibleText,
 		HasVisibleOutput: hasVisibleText || hasVisibleThinking || hasToolCalls,
 		ShouldFail:       err != nil,
 	}
 }
 func OpenAIChatUsage(turn Turn) map[string]any {
 	return map[string]any{
 		"prompt_tokens":     turn.Usage.InputTokens,
 		"completion_tokens": turn.Usage.OutputTokens,
 		"total_tokens":      turn.Usage.TotalTokens,
 		"completion_tokens_details": map[string]any{
 			"reasoning_tokens": turn.Usage.ReasoningTokens,
 		},
 	}
 }
 func OpenAIResponsesUsage(turn Turn) map[string]any {
 	return map[string]any{
 		"input_tokens":  turn.Usage.InputTokens,
 		"output_tokens": turn.Usage.OutputTokens,
 		"total_tokens":  turn.Usage.TotalTokens,
 	}
 }
 func FinishReason(turn Turn) string {
 	switch turn.StopReason {
 	case StopReasonToolCalls:
 		return "tool_calls"
 	case StopReasonContentFilter:
 		return "content_filter"
 	default:
 		return "stop"
 	}
 }
--- a/internal/assistantturn/turn_test.go
+++ b/internal/assistantturn/turn_test.go
@@ -0,0 +1,141 @@
 package assistantturn
 import (
 	"testing"
 	"ds2api/internal/promptcompat"
 	"ds2api/internal/sse"
 )
 func TestBuildTurnFromCollectedTextCitation(t *testing.T) {
 	turn := BuildTurnFromCollected(sse.CollectResult{
 		Text:          "See [citation:1]",
 		CitationLinks: map[int]string{1: "https://example.com"},
 	}, BuildOptions{Model: "deepseek-v4-flash", Prompt: "prompt", SearchEnabled: true})
 	if turn.Text != "See [1](https://example.com)" {
 		t.Fatalf("text mismatch: %q", turn.Text)
 	}
 	if turn.StopReason != StopReasonStop {
 		t.Fatalf("stop reason mismatch: %q", turn.StopReason)
 	}
 	if turn.Error != nil {
 		t.Fatalf("unexpected error: %#v", turn.Error)
 	}
 }
 func TestBuildTurnFromCollectedKeepsNonStreamReferenceLinks(t *testing.T) {
 	turn := BuildTurnFromCollected(sse.CollectResult{
 		Text: "结论[reference:0]，补充[reference:1]。",
 		CitationLinks: map[int]string{
 			1: "https://example.com/a",
 			2: "https://example.com/b",
 		},
 	}, BuildOptions{Model: "deepseek-v4-flash-search", Prompt: "prompt", SearchEnabled: true})
 	want := "结论[0](https://example.com/a)，补充[1](https://example.com/b)。"
 	if turn.Text != want {
 		t.Fatalf("text mismatch: got %q want %q", turn.Text, want)
 	}
 }
 func TestBuildTurnFromCollectedToolCall(t *testing.T) {
 	turn := BuildTurnFromCollected(sse.CollectResult{
 		Text: `<tool_calls><invoke name="Write"><parameter name="content">{"x":1}</parameter></invoke></tool_calls>`,
 	}, BuildOptions{
 		ToolNames: []string{"Write"},
 		ToolsRaw: []any{map[string]any{
 			"name": "Write",
 			"input_schema": map[string]any{
 				"type": "object",
 				"properties": map[string]any{
 					"content": map[string]any{"type": "string"},
 				},
 			},
 		}},
 	})
 	if len(turn.ToolCalls) != 1 {
 		t.Fatalf("expected one tool call, got %d", len(turn.ToolCalls))
 	}
 	if turn.StopReason != StopReasonToolCalls {
 		t.Fatalf("stop reason mismatch: %q", turn.StopReason)
 	}
 	if _, ok := turn.ToolCalls[0].Input["content"].(string); !ok {
 		t.Fatalf("expected content coerced to string, got %#v", turn.ToolCalls[0].Input["content"])
 	}
 }
 func TestBuildTurnFromCollectedThinkingOnlyIsEmptyOutput(t *testing.T) {
 	turn := BuildTurnFromCollected(sse.CollectResult{Thinking: "hidden"}, BuildOptions{})
 	if turn.Error == nil || turn.Error.Code != "upstream_empty_output" {
 		t.Fatalf("expected empty output error, got %#v", turn.Error)
 	}
 }
 func TestBuildTurnFromCollectedToolChoiceRequired(t *testing.T) {
 	turn := BuildTurnFromCollected(sse.CollectResult{Text: "hello"}, BuildOptions{
 		ToolChoice: promptcompat.ToolChoicePolicy{Mode: promptcompat.ToolChoiceRequired},
 	})
 	if turn.Error == nil || turn.Error.Code != "tool_choice_violation" {
 		t.Fatalf("expected tool choice violation, got %#v", turn.Error)
 	}
 }
 func TestBuildTurnFromStreamSnapshotUsesVisibleTextAndRawToolDetection(t *testing.T) {
 	turn := BuildTurnFromStreamSnapshot(StreamSnapshot{
 		RawText:     `<tool_calls><invoke name="Write"><parameter name="content">{"x":1}</parameter></invoke></tool_calls>`,
 		VisibleText: "",
 	}, BuildOptions{
 		ToolNames: []string{"Write"},
 		ToolsRaw: []any{map[string]any{
 			"name": "Write",
 			"schema": map[string]any{
 				"type": "object",
 				"properties": map[string]any{
 					"content": map[string]any{"type": "string"},
 				},
 			},
 		}},
 	})
 	if len(turn.ToolCalls) != 1 {
 		t.Fatalf("expected stream snapshot tool call, got %d", len(turn.ToolCalls))
 	}
 	if _, ok := turn.ToolCalls[0].Input["content"].(string); !ok {
 		t.Fatalf("expected stream snapshot schema coercion, got %#v", turn.ToolCalls[0].Input["content"])
 	}
 }
 func TestBuildTurnFromStreamSnapshotAlreadyEmittedToolAvoidsEmptyError(t *testing.T) {
 	turn := BuildTurnFromStreamSnapshot(StreamSnapshot{AlreadyEmittedCalls: true}, BuildOptions{})
 	if turn.Error != nil {
 		t.Fatalf("unexpected empty-output error after emitted tool call: %#v", turn.Error)
 	}
 	if turn.StopReason != StopReasonToolCalls {
 		t.Fatalf("stop reason mismatch: %q", turn.StopReason)
 	}
 }
 func TestFinalizeTurnStopOutcome(t *testing.T) {
 	turn := BuildTurnFromCollected(sse.CollectResult{Text: "hello"}, BuildOptions{})
 	outcome := FinalizeTurn(turn, FinalizeOptions{})
 	if outcome.ShouldFail {
 		t.Fatalf("unexpected failure: %#v", outcome.Error)
 	}
 	if outcome.FinishReason != "stop" || !outcome.HasVisibleText || !outcome.HasVisibleOutput {
 		t.Fatalf("unexpected outcome: %#v", outcome)
 	}
 }
 func TestFinalizeTurnToolCallsOutcome(t *testing.T) {
 	turn := BuildTurnFromStreamSnapshot(StreamSnapshot{AlreadyEmittedCalls: true}, BuildOptions{})
 	outcome := FinalizeTurn(turn, FinalizeOptions{AlreadyEmittedToolCalls: true})
 	if outcome.ShouldFail || outcome.FinishReason != "tool_calls" || !outcome.HasToolCalls {
 		t.Fatalf("unexpected tool outcome: %#v", outcome)
 	}
 }
 func TestFinalizeTurnContentFilterOutcome(t *testing.T) {
 	turn := BuildTurnFromCollected(sse.CollectResult{ContentFilter: true}, BuildOptions{})
 	outcome := FinalizeTurn(turn, FinalizeOptions{})
 	if !outcome.ShouldFail || outcome.Error == nil || outcome.Error.Code != "content_filter" {
 		t.Fatalf("expected content filter failure, got %#v", outcome)
 	}
 }
--- a/internal/chathistory/store.go
+++ b/internal/chathistory/store.go
@@ -14,6 +14,7 @@ import (
 	"github.com/google/uuid"
 	"ds2api/internal/config"
 	"ds2api/internal/util"
 )
 const (
@@ -42,6 +43,7 @@ type Entry struct {
 	Status           string         `json:"status"`
 	CallerID         string         `json:"caller_id,omitempty"`
 	AccountID        string         `json:"account_id,omitempty"`
 	Surface          string         `json:"surface,omitempty"`
 	Model            string         `json:"model,omitempty"`
 	Stream           bool           `json:"stream"`
 	UserInput        string         `json:"user_input,omitempty"`
@@ -71,6 +73,7 @@ type SummaryEntry struct {
 	Status         string `json:"status"`
 	CallerID       string `json:"caller_id,omitempty"`
 	AccountID      string `json:"account_id,omitempty"`
 	Surface        string `json:"surface,omitempty"`
 	Model          string `json:"model,omitempty"`
 	Stream         bool   `json:"stream"`
 	UserInput      string `json:"user_input,omitempty"`
@@ -91,6 +94,7 @@ type File struct {
 type StartParams struct {
 	CallerID    string
 	AccountID   string
 	Surface     string
 	Model       string
 	Stream      bool
 	UserInput   string
@@ -270,6 +274,7 @@ func (s *Store) Start(params StartParams) (Entry, error) {
 		Status:      "streaming",
 		CallerID:    strings.TrimSpace(params.CallerID),
 		AccountID:   strings.TrimSpace(params.AccountID),
 		Surface:     strings.TrimSpace(params.Surface),
 		Model:       strings.TrimSpace(params.Model),
 		Stream:      params.Stream,
 		UserInput:   strings.TrimSpace(params.UserInput),
@@ -309,8 +314,12 @@ func (s *Store) Update(id string, params UpdateParams) (Entry, error) {
 	if params.Status != "" {
 		item.Status = params.Status
 	}
-	item.ReasoningContent = params.ReasoningContent
+	if params.ReasoningContent != "" || item.ReasoningContent == "" {
-	item.Content = params.Content
+		item.ReasoningContent = params.ReasoningContent
 	}
 	if params.Content != "" || item.Content == "" {
 		item.Content = params.Content
 	}
 	item.Error = strings.TrimSpace(params.Error)
 	item.StatusCode = params.StatusCode
 	item.ElapsedMs = params.ElapsedMs
@@ -541,10 +550,13 @@ func (s *Store) rebuildIndexLocked() {
 		summaries = append(summaries, summaryFromEntry(item))
 	}
 	sort.Slice(summaries, func(i, j int) bool {
-		if summaries[i].UpdatedAt == summaries[j].UpdatedAt {
+		if summaries[i].CreatedAt == summaries[j].CreatedAt {
-			return summaries[i].CreatedAt > summaries[j].CreatedAt
+			if summaries[i].Revision == summaries[j].Revision {
 				return summaries[i].UpdatedAt > summaries[j].UpdatedAt
 			}
 			return summaries[i].Revision > summaries[j].Revision
 		}
-		return summaries[i].UpdatedAt > summaries[j].UpdatedAt
+		return summaries[i].CreatedAt > summaries[j].CreatedAt
 	})
 	if s.state.Limit < DisabledLimit || !isAllowedLimit(s.state.Limit) {
 		s.state.Limit = DefaultLimit
@@ -588,6 +600,7 @@ func summaryFromEntry(item Entry) SummaryEntry {
 		Status:         item.Status,
 		CallerID:       item.CallerID,
 		AccountID:      item.AccountID,
 		Surface:        item.Surface,
 		Model:          item.Model,
 		Stream:         item.Stream,
 		UserInput:      item.UserInput,
@@ -610,8 +623,8 @@ func buildPreview(item Entry) string {
 	if candidate == "" {
 		candidate = strings.TrimSpace(item.UserInput)
 	}
-	if len(candidate) > defaultPreviewAt {
+	if truncated, ok := util.TruncateRunes(candidate, defaultPreviewAt); ok {
-		return candidate[:defaultPreviewAt] + "..."
+		return truncated + "..."
 	}
 	return candidate
 }
--- a/internal/chathistory/store_test.go
+++ b/internal/chathistory/store_test.go
@@ -8,6 +8,8 @@ import (
 	"strings"
 	"sync"
 	"testing"
 	"time"
 	"unicode/utf8"
 )
 func blockDetailDir(t *testing.T, detailDir string) func() {
@@ -105,6 +107,17 @@ func TestStoreCreatesAndPersistsEntries(t *testing.T) {
 	}
 }
 func TestBuildPreviewPreservesUTF8MB4Characters(t *testing.T) {
 	long := strings.Repeat("😀", defaultPreviewAt+1)
 	preview := buildPreview(Entry{Content: long})
 	if !utf8.ValidString(preview) {
 		t.Fatalf("expected valid utf-8 preview, got %q", preview)
 	}
 	if preview != strings.Repeat("😀", defaultPreviewAt)+"..." {
 		t.Fatalf("unexpected preview: %q", preview)
 	}
 }
 func TestStoreTrimsToConfiguredLimit(t *testing.T) {
 	path := filepath.Join(t.TempDir(), "chat_history.json")
 	store := New(path)
@@ -481,3 +494,142 @@ func TestStoreWritesOnlyChangedDetailFiles(t *testing.T) {
 		t.Fatalf("expected untouched detail file to remain byte-identical")
 	}
 }
 func TestStoreOrdersByCreationTimeNotStreamingUpdates(t *testing.T) {
 	path := filepath.Join(t.TempDir(), "chat_history.json")
 	store := New(path)
 	first, err := store.Start(StartParams{UserInput: "first"})
 	if err != nil {
 		t.Fatalf("start first failed: %v", err)
 	}
 	time.Sleep(time.Millisecond)
 	second, err := store.Start(StartParams{UserInput: "second"})
 	if err != nil {
 		t.Fatalf("start second failed: %v", err)
 	}
 	time.Sleep(time.Millisecond)
 	if _, err := store.Update(first.ID, UpdateParams{Status: "streaming", Content: "still running"}); err != nil {
 		t.Fatalf("update first failed: %v", err)
 	}
 	snapshot, err := store.Snapshot()
 	if err != nil {
 		t.Fatalf("snapshot failed: %v", err)
 	}
 	if len(snapshot.Items) != 2 {
 		t.Fatalf("expected two items, got %#v", snapshot.Items)
 	}
 	if snapshot.Items[0].ID != second.ID || snapshot.Items[1].ID != first.ID {
 		t.Fatalf("expected creation-time order to stay stable, got %#v", snapshot.Items)
 	}
 }
 func TestUpdatePreservesContentWhenNewContentIsEmpty(t *testing.T) {
 	path := filepath.Join(t.TempDir(), "chat_history.json")
 	store := New(path)
 	started, err := store.Start(StartParams{
 		CallerID:  "caller:abc",
 		Model:     "deepseek-v4-flash",
 		Stream:    true,
 		UserInput: "hello",
 	})
 	if err != nil {
 		t.Fatalf("start entry failed: %v", err)
 	}
 	if _, err := store.Update(started.ID, UpdateParams{
 		Status:           "streaming",
 		ReasoningContent: "let me think",
 		Content:          "I'll help you with that.",
 	}); err != nil {
 		t.Fatalf("progress update failed: %v", err)
 	}
 	updated, err := store.Update(started.ID, UpdateParams{
 		Status:    "success",
 		Content:   "",
 		Completed: true,
 	})
 	if err != nil {
 		t.Fatalf("success update failed: %v", err)
 	}
 	if updated.Content != "I'll help you with that." {
 		t.Fatalf("expected content to be preserved, got %q", updated.Content)
 	}
 	if updated.ReasoningContent != "let me think" {
 		t.Fatalf("expected reasoning content to be preserved, got %q", updated.ReasoningContent)
 	}
 	full, err := store.Get(started.ID)
 	if err != nil {
 		t.Fatalf("get entry failed: %v", err)
 	}
 	if full.Content != "I'll help you with that." {
 		t.Fatalf("expected persisted content to be preserved, got %q", full.Content)
 	}
 	if full.ReasoningContent != "let me think" {
 		t.Fatalf("expected persisted reasoning content to be preserved, got %q", full.ReasoningContent)
 	}
 }
 func TestUpdateAllowsSettingContentFromEmpty(t *testing.T) {
 	path := filepath.Join(t.TempDir(), "chat_history.json")
 	store := New(path)
 	started, err := store.Start(StartParams{
 		CallerID:  "caller:abc",
 		Model:     "deepseek-v4-flash",
 		Stream:    true,
 		UserInput: "hello",
 	})
 	if err != nil {
 		t.Fatalf("start entry failed: %v", err)
 	}
 	updated, err := store.Update(started.ID, UpdateParams{
 		Status:  "success",
 		Content: "final answer",
 	})
 	if err != nil {
 		t.Fatalf("update failed: %v", err)
 	}
 	if updated.Content != "final answer" {
 		t.Fatalf("expected content to be set, got %q", updated.Content)
 	}
 }
 func TestUpdateAllowsOverwritingContentWithNewValue(t *testing.T) {
 	path := filepath.Join(t.TempDir(), "chat_history.json")
 	store := New(path)
 	started, err := store.Start(StartParams{
 		CallerID:  "caller:abc",
 		Model:     "deepseek-v4-flash",
 		Stream:    true,
 		UserInput: "hello",
 	})
 	if err != nil {
 		t.Fatalf("start entry failed: %v", err)
 	}
 	if _, err := store.Update(started.ID, UpdateParams{
 		Status:  "streaming",
 		Content: "partial",
 	}); err != nil {
 		t.Fatalf("first update failed: %v", err)
 	}
 	updated, err := store.Update(started.ID, UpdateParams{
 		Status:  "success",
 		Content: "final answer",
 	})
 	if err != nil {
 		t.Fatalf("second update failed: %v", err)
 	}
 	if updated.Content != "final answer" {
 		t.Fatalf("expected content to be overwritten, got %q", updated.Content)
 	}
 }
--- a/internal/completionruntime/nonstream.go
+++ b/internal/completionruntime/nonstream.go
@@ -0,0 +1,193 @@
 package completionruntime
 import (
 	"context"
 	"fmt"
 	"io"
 	"net/http"
 	"strings"
 	"ds2api/internal/assistantturn"
 	"ds2api/internal/auth"
 	"ds2api/internal/config"
 	dsclient "ds2api/internal/deepseek/client"
 	"ds2api/internal/httpapi/openai/history"
 	"ds2api/internal/httpapi/openai/shared"
 	"ds2api/internal/promptcompat"
 	"ds2api/internal/sse"
 )
 type DeepSeekCaller interface {
 	CreateSession(ctx context.Context, a *auth.RequestAuth, maxAttempts int) (string, error)
 	GetPow(ctx context.Context, a *auth.RequestAuth, maxAttempts int) (string, error)
 	UploadFile(ctx context.Context, a *auth.RequestAuth, req dsclient.UploadFileRequest, maxAttempts int) (*dsclient.UploadFileResult, error)
 	CallCompletion(ctx context.Context, a *auth.RequestAuth, payload map[string]any, powResp string, maxAttempts int) (*http.Response, error)
 }
 type Options struct {
 	StripReferenceMarkers bool
 	MaxAttempts           int
 	RetryEnabled          bool
 	RetryMaxAttempts      int
 	CurrentInputFile      history.CurrentInputConfigReader
 }
 type NonStreamResult struct {
 	SessionID string
 	Payload   map[string]any
 	Turn      assistantturn.Turn
 	Attempts  int
 }
 type StartResult struct {
 	SessionID string
 	Payload   map[string]any
 	Pow       string
 	Response  *http.Response
 	Request   promptcompat.StandardRequest
 }
 func StartCompletion(ctx context.Context, ds DeepSeekCaller, a *auth.RequestAuth, stdReq promptcompat.StandardRequest, opts Options) (StartResult, *assistantturn.OutputError) {
 	maxAttempts := opts.MaxAttempts
 	if maxAttempts <= 0 {
 		maxAttempts = 3
 	}
 	var prepErr *assistantturn.OutputError
 	stdReq, prepErr = prepareCurrentInputFile(ctx, ds, a, stdReq, opts)
 	if prepErr != nil {
 		return StartResult{Request: stdReq}, prepErr
 	}
 	sessionID, err := ds.CreateSession(ctx, a, maxAttempts)
 	if err != nil {
 		return StartResult{Request: stdReq}, authOutputError(a)
 	}
 	pow, err := ds.GetPow(ctx, a, maxAttempts)
 	if err != nil {
 		return StartResult{SessionID: sessionID, Request: stdReq}, &assistantturn.OutputError{Status: http.StatusUnauthorized, Message: "Failed to get PoW (invalid token or unknown error).", Code: "error"}
 	}
 	payload := stdReq.CompletionPayload(sessionID)
 	resp, err := ds.CallCompletion(ctx, a, payload, pow, maxAttempts)
 	if err != nil {
 		return StartResult{SessionID: sessionID, Payload: payload, Pow: pow, Request: stdReq}, &assistantturn.OutputError{Status: http.StatusInternalServerError, Message: "Failed to get completion.", Code: "error"}
 	}
 	return StartResult{SessionID: sessionID, Payload: payload, Pow: pow, Response: resp, Request: stdReq}, nil
 }
 func prepareCurrentInputFile(ctx context.Context, ds DeepSeekCaller, a *auth.RequestAuth, stdReq promptcompat.StandardRequest, opts Options) (promptcompat.StandardRequest, *assistantturn.OutputError) {
 	if opts.CurrentInputFile == nil || stdReq.CurrentInputFileApplied {
 		return stdReq, nil
 	}
 	out, err := (history.Service{Store: opts.CurrentInputFile, DS: ds}).ApplyCurrentInputFile(ctx, a, stdReq)
 	if err != nil {
 		status, message := history.MapError(err)
 		return out, &assistantturn.OutputError{Status: status, Message: message, Code: "error"}
 	}
 	return out, nil
 }
 func ExecuteNonStreamWithRetry(ctx context.Context, ds DeepSeekCaller, a *auth.RequestAuth, stdReq promptcompat.StandardRequest, opts Options) (NonStreamResult, *assistantturn.OutputError) {
 	start, startErr := StartCompletion(ctx, ds, a, stdReq, opts)
 	if startErr != nil {
 		return NonStreamResult{SessionID: start.SessionID, Payload: start.Payload}, startErr
 	}
 	stdReq = start.Request
 	maxAttempts := opts.MaxAttempts
 	if maxAttempts <= 0 {
 		maxAttempts = 3
 	}
 	sessionID := start.SessionID
 	payload := start.Payload
 	pow := start.Pow
 	attempts := 0
 	currentResp := start.Response
 	usagePrompt := stdReq.PromptTokenText
 	accumulatedThinking := ""
 	accumulatedRawThinking := ""
 	accumulatedToolDetectionThinking := ""
 	for {
 		turn, outErr := collectAttempt(currentResp, stdReq, usagePrompt, opts)
 		if outErr != nil {
 			return NonStreamResult{SessionID: sessionID, Payload: payload, Attempts: attempts}, outErr
 		}
 		accumulatedThinking += sse.TrimContinuationOverlap(accumulatedThinking, turn.Thinking)
 		accumulatedRawThinking += sse.TrimContinuationOverlap(accumulatedRawThinking, turn.RawThinking)
 		accumulatedToolDetectionThinking += sse.TrimContinuationOverlap(accumulatedToolDetectionThinking, turn.DetectionThinking)
 		turn.Thinking = accumulatedThinking
 		turn.RawThinking = accumulatedRawThinking
 		turn.DetectionThinking = accumulatedToolDetectionThinking
 		turn = assistantturn.BuildTurnFromCollected(sse.CollectResult{
 			Text:                  turn.RawText,
 			Thinking:              turn.RawThinking,
 			ToolDetectionThinking: turn.DetectionThinking,
 			ContentFilter:         turn.ContentFilter,
 			CitationLinks:         turn.CitationLinks,
 			ResponseMessageID:     turn.ResponseMessageID,
 		}, buildOptions(stdReq, usagePrompt, opts))
 		retryMax := opts.RetryMaxAttempts
 		if retryMax <= 0 {
 			retryMax = shared.EmptyOutputRetryMaxAttempts()
 		}
 		if !opts.RetryEnabled || !assistantturn.ShouldRetryEmptyOutput(turn, attempts, retryMax) {
 			return NonStreamResult{SessionID: sessionID, Payload: payload, Turn: turn, Attempts: attempts}, turn.Error
 		}
 		attempts++
 		config.Logger.Info("[completion_runtime_empty_retry] attempting synthetic retry", "surface", stdReq.Surface, "stream", false, "retry_attempt", attempts, "parent_message_id", turn.ResponseMessageID)
 		retryPow, powErr := ds.GetPow(ctx, a, maxAttempts)
 		if powErr != nil {
 			config.Logger.Warn("[completion_runtime_empty_retry] retry PoW fetch failed, falling back to original PoW", "surface", stdReq.Surface, "retry_attempt", attempts, "error", powErr)
 			retryPow = pow
 		}
 		retryPayload := shared.ClonePayloadForEmptyOutputRetry(payload, turn.ResponseMessageID)
 		nextResp, err := ds.CallCompletion(ctx, a, retryPayload, retryPow, maxAttempts)
 		if err != nil {
 			return NonStreamResult{SessionID: sessionID, Payload: payload, Turn: turn, Attempts: attempts}, &assistantturn.OutputError{Status: http.StatusInternalServerError, Message: "Failed to get completion.", Code: "error"}
 		}
 		usagePrompt = shared.UsagePromptWithEmptyOutputRetry(usagePrompt, attempts)
 		currentResp = nextResp
 	}
 }
 func collectAttempt(resp *http.Response, stdReq promptcompat.StandardRequest, usagePrompt string, opts Options) (assistantturn.Turn, *assistantturn.OutputError) {
 	defer func() {
 		if err := resp.Body.Close(); err != nil {
 			config.Logger.Warn("[completion_runtime] response body close failed", "surface", stdReq.Surface, "error", err)
 		}
 	}()
 	if resp.StatusCode != http.StatusOK {
 		body, _ := io.ReadAll(resp.Body)
 		message := strings.TrimSpace(string(body))
 		if message == "" {
 			message = http.StatusText(resp.StatusCode)
 		}
 		return assistantturn.Turn{}, &assistantturn.OutputError{Status: resp.StatusCode, Message: message, Code: "error"}
 	}
 	result := sse.CollectStream(resp, stdReq.Thinking, false)
 	return assistantturn.BuildTurnFromCollected(result, buildOptions(stdReq, usagePrompt, opts)), nil
 }
 func buildOptions(stdReq promptcompat.StandardRequest, prompt string, opts Options) assistantturn.BuildOptions {
 	return assistantturn.BuildOptions{
 		Model:                 stdReq.ResponseModel,
 		Prompt:                prompt,
 		RefFileTokens:         stdReq.RefFileTokens,
 		SearchEnabled:         stdReq.Search,
 		StripReferenceMarkers: opts.StripReferenceMarkers,
 		ToolNames:             stdReq.ToolNames,
 		ToolsRaw:              stdReq.ToolsRaw,
 		ToolChoice:            stdReq.ToolChoice,
 	}
 }
 func authOutputError(a *auth.RequestAuth) *assistantturn.OutputError {
 	if a != nil && a.UseConfigToken {
 		return &assistantturn.OutputError{Status: http.StatusUnauthorized, Message: "Account token is invalid. Please re-login the account in admin.", Code: "error"}
 	}
 	return &assistantturn.OutputError{Status: http.StatusUnauthorized, Message: "Invalid token. If this should be a DS2API key, add it to config.keys first.", Code: "error"}
 }
 func Errorf(status int, format string, args ...any) *assistantturn.OutputError {
 	return &assistantturn.OutputError{Status: status, Message: fmt.Sprintf(format, args...), Code: "error"}
 }
--- a/internal/completionruntime/nonstream_test.go
+++ b/internal/completionruntime/nonstream_test.go
@@ -0,0 +1,197 @@
 package completionruntime
 import (
 	"context"
 	"io"
 	"net/http"
 	"strings"
 	"testing"
 	"ds2api/internal/auth"
 	dsclient "ds2api/internal/deepseek/client"
 	"ds2api/internal/promptcompat"
 )
 type fakeDeepSeekCaller struct {
 	responses []*http.Response
 	payloads  []map[string]any
 	uploads   []dsclient.UploadFileRequest
 }
 type currentInputRuntimeConfig struct{}
 func (currentInputRuntimeConfig) CurrentInputFileEnabled() bool { return true }
 func (currentInputRuntimeConfig) CurrentInputFileMinChars() int { return 0 }
 func (f *fakeDeepSeekCaller) CreateSession(context.Context, *auth.RequestAuth, int) (string, error) {
 	return "session-1", nil
 }
 func (f *fakeDeepSeekCaller) GetPow(context.Context, *auth.RequestAuth, int) (string, error) {
 	return "pow", nil
 }
 func (f *fakeDeepSeekCaller) UploadFile(_ context.Context, _ *auth.RequestAuth, req dsclient.UploadFileRequest, _ int) (*dsclient.UploadFileResult, error) {
 	f.uploads = append(f.uploads, req)
 	return &dsclient.UploadFileResult{ID: "file-runtime-1"}, nil
 }
 func (f *fakeDeepSeekCaller) CallCompletion(_ context.Context, _ *auth.RequestAuth, payload map[string]any, _ string, _ int) (*http.Response, error) {
 	f.payloads = append(f.payloads, payload)
 	if len(f.responses) == 0 {
 		return sseHTTPResponse(http.StatusOK, `data: {"p":"response/content","v":"fallback"}`), nil
 	}
 	resp := f.responses[0]
 	f.responses = f.responses[1:]
 	return resp, nil
 }
 func TestExecuteNonStreamWithRetryBuildsCanonicalTurn(t *testing.T) {
 	ds := &fakeDeepSeekCaller{responses: []*http.Response{sseHTTPResponse(
 		http.StatusOK,
 		`data: {"response_message_id":42,"p":"response/content","v":"<tool_calls><invoke name=\"Write\"><parameter name=\"content\">{\"x\":1}</parameter></invoke></tool_calls>"}`,
 	)}}
 	stdReq := promptcompat.StandardRequest{
 		Surface:         "test",
 		ResponseModel:   "deepseek-v4-flash",
 		PromptTokenText: "prompt",
 		FinalPrompt:     "final prompt",
 		ToolNames:       []string{"Write"},
 		ToolsRaw: []any{map[string]any{
 			"name": "Write",
 			"input_schema": map[string]any{
 				"type": "object",
 				"properties": map[string]any{
 					"content": map[string]any{"type": "string"},
 				},
 			},
 		}},
 	}
 	result, outErr := ExecuteNonStreamWithRetry(context.Background(), ds, &auth.RequestAuth{}, stdReq, Options{})
 	if outErr != nil {
 		t.Fatalf("unexpected output error: %#v", outErr)
 	}
 	if result.SessionID != "session-1" {
 		t.Fatalf("session mismatch: %q", result.SessionID)
 	}
 	if got := result.Turn.ResponseMessageID; got != 42 {
 		t.Fatalf("response message id mismatch: %d", got)
 	}
 	if len(result.Turn.ToolCalls) != 1 {
 		t.Fatalf("expected one tool call, got %d", len(result.Turn.ToolCalls))
 	}
 	if _, ok := result.Turn.ToolCalls[0].Input["content"].(string); !ok {
 		t.Fatalf("expected schema-normalized string argument, got %#v", result.Turn.ToolCalls[0].Input["content"])
 	}
 	if result.Turn.Usage.InputTokens == 0 || result.Turn.Usage.TotalTokens == 0 {
 		t.Fatalf("expected usage to be populated, got %#v", result.Turn.Usage)
 	}
 }
 func TestExecuteNonStreamWithRetryUsesParentMessageForEmptyRetry(t *testing.T) {
 	ds := &fakeDeepSeekCaller{responses: []*http.Response{
 		sseHTTPResponse(http.StatusOK, `data: {"response_message_id":77,"p":"response/status","v":"FINISHED"}`),
 		sseHTTPResponse(http.StatusOK, `data: {"response_message_id":78,"p":"response/content","v":"ok"}`),
 	}}
 	stdReq := promptcompat.StandardRequest{
 		Surface:         "test",
 		ResponseModel:   "deepseek-v4-flash",
 		PromptTokenText: "prompt",
 		FinalPrompt:     "final prompt",
 	}
 	result, outErr := ExecuteNonStreamWithRetry(context.Background(), ds, &auth.RequestAuth{}, stdReq, Options{RetryEnabled: true})
 	if outErr != nil {
 		t.Fatalf("unexpected output error: %#v", outErr)
 	}
 	if result.Attempts != 1 {
 		t.Fatalf("expected one retry, got %d", result.Attempts)
 	}
 	if len(ds.payloads) != 2 {
 		t.Fatalf("expected two completion calls, got %d", len(ds.payloads))
 	}
 	if got := ds.payloads[1]["parent_message_id"]; got != 77 {
 		t.Fatalf("retry parent_message_id mismatch: %#v", got)
 	}
 	if result.Turn.Text != "ok" {
 		t.Fatalf("retry text mismatch: %q", result.Turn.Text)
 	}
 }
 func TestExecuteNonStreamWithRetryConvertsReferenceMarkers(t *testing.T) {
 	ds := &fakeDeepSeekCaller{responses: []*http.Response{sseHTTPResponse(
 		http.StatusOK,
 		`data: {"p":"response/content","v":"答案[reference:0]。","citation":{"cite_index":0,"url":"https://example.com/ref"}}`,
 	)}}
 	stdReq := promptcompat.StandardRequest{
 		Surface:         "test",
 		ResponseModel:   "deepseek-v4-flash-search",
 		PromptTokenText: "prompt",
 		FinalPrompt:     "final prompt",
 		Search:          true,
 	}
 	result, outErr := ExecuteNonStreamWithRetry(context.Background(), ds, &auth.RequestAuth{}, stdReq, Options{})
 	if outErr != nil {
 		t.Fatalf("unexpected output error: %#v", outErr)
 	}
 	want := "答案[0](https://example.com/ref)。"
 	if result.Turn.Text != want {
 		t.Fatalf("text mismatch: got %q want %q", result.Turn.Text, want)
 	}
 }
 func TestStartCompletionAppliesCurrentInputFileGlobally(t *testing.T) {
 	ds := &fakeDeepSeekCaller{responses: []*http.Response{sseHTTPResponse(http.StatusOK, `data: {"p":"response/content","v":"ok"}`)}}
 	stdReq := promptcompat.StandardRequest{
 		Surface:         "test_adapter",
 		RequestedModel:  "deepseek-v4-flash",
 		ResolvedModel:   "deepseek-v4-flash",
 		ResponseModel:   "deepseek-v4-flash",
 		PromptTokenText: "first user turn",
 		FinalPrompt:     "first user turn",
 		Messages: []any{
 			map[string]any{"role": "user", "content": "first user turn"},
 		},
 	}
 	start, outErr := StartCompletion(context.Background(), ds, &auth.RequestAuth{DeepSeekToken: "token"}, stdReq, Options{
 		CurrentInputFile: currentInputRuntimeConfig{},
 	})
 	if outErr != nil {
 		t.Fatalf("unexpected output error: %#v", outErr)
 	}
 	if len(ds.uploads) != 1 {
 		t.Fatalf("expected current input upload, got %d", len(ds.uploads))
 	}
 	if got := ds.uploads[0].Filename; got != "DS2API_HISTORY.txt" {
 		t.Fatalf("upload filename=%q want DS2API_HISTORY.txt", got)
 	}
 	if len(ds.payloads) != 1 {
 		t.Fatalf("expected one completion payload, got %d", len(ds.payloads))
 	}
 	refIDs, _ := ds.payloads[0]["ref_file_ids"].([]any)
 	if len(refIDs) != 1 || refIDs[0] != "file-runtime-1" {
 		t.Fatalf("expected uploaded file id in ref_file_ids, got %#v", ds.payloads[0]["ref_file_ids"])
 	}
 	prompt, _ := ds.payloads[0]["prompt"].(string)
 	if !strings.Contains(prompt, "Continue from the latest state in the attached DS2API_HISTORY.txt context.") {
 		t.Fatalf("expected continuation prompt, got %q", prompt)
 	}
 	if !start.Request.CurrentInputFileApplied || !strings.Contains(start.Request.PromptTokenText, "# DS2API_HISTORY.txt") {
 		t.Fatalf("expected prepared request to carry current input file state, got %#v", start.Request)
 	}
 }
 func sseHTTPResponse(status int, lines ...string) *http.Response {
 	body := strings.Join(lines, "\n")
 	if !strings.HasSuffix(body, "\n") {
 		body += "\n"
 	}
 	return &http.Response{
 		StatusCode: status,
 		Header:     make(http.Header),
 		Body:       io.NopCloser(strings.NewReader(body)),
 	}
 }
--- a/internal/config/codec.go
+++ b/internal/config/codec.go
@@ -35,9 +35,6 @@ func (c Config) MarshalJSON() ([]byte, error) {
 	if c.Runtime.AccountMaxInflight > 0 || c.Runtime.AccountMaxQueue > 0 || c.Runtime.GlobalMaxInflight > 0 || c.Runtime.TokenRefreshIntervalHours > 0 {
 		m["runtime"] = c.Runtime
 	}
 	if c.Compat.WideInputStrictOutput != nil || c.Compat.StripReferenceMarkers != nil {
 		m["compat"] = c.Compat
 	}
 	if c.Responses.StoreTTLSeconds > 0 {
 		m["responses"] = c.Responses
 	}
@@ -45,9 +42,6 @@ func (c Config) MarshalJSON() ([]byte, error) {
 		m["embeddings"] = c.Embeddings
 	}
 	m["auto_delete"] = c.AutoDelete
 	if c.HistorySplit.Enabled != nil || c.HistorySplit.TriggerAfterTurns != nil {
 		m["history_split"] = c.HistorySplit
 	}
 	if c.CurrentInputFile.Enabled != nil || c.CurrentInputFile.MinChars != 0 {
 		m["current_input_file"] = c.CurrentInputFile
 	}
@@ -103,8 +97,9 @@ func (c *Config) UnmarshalJSON(b []byte) error {
 				return fmt.Errorf("invalid field %q: %w", k, err)
 			}
 		case "compat":
-			if err := json.Unmarshal(v, &c.Compat); err != nil {
+			// Removed field ignored instead of persisted.
-				return fmt.Errorf("invalid field %q: %w", k, err)
+			if Logger != nil {
 				Logger.Warn("config key \"compat\" is deprecated and ignored; remove it from your configuration")
 			}
 		case "toolcall":
 			// Legacy field ignored. Toolcall policy is fixed and no longer configurable.
@@ -121,9 +116,7 @@ func (c *Config) UnmarshalJSON(b []byte) error {
 				return fmt.Errorf("invalid field %q: %w", k, err)
 			}
 		case "history_split":
-			if err := json.Unmarshal(v, &c.HistorySplit); err != nil {
+			// Removed legacy split field is ignored instead of persisted.
 				return fmt.Errorf("invalid field %q: %w", k, err)
 			}
 		case "current_input_file":
 			if err := json.Unmarshal(v, &c.CurrentInputFile); err != nil {
 				return fmt.Errorf("invalid field %q: %w", k, err)
@@ -160,17 +153,9 @@ func (c Config) Clone() Config {
 		ModelAliases: cloneStringMap(c.ModelAliases),
 		Admin:        c.Admin,
 		Runtime:      c.Runtime,
-		Compat: CompatConfig{
+		Responses:    c.Responses,
-			WideInputStrictOutput: cloneBoolPtr(c.Compat.WideInputStrictOutput),
+		Embeddings:   c.Embeddings,
-			StripReferenceMarkers: cloneBoolPtr(c.Compat.StripReferenceMarkers),
+		AutoDelete:   c.AutoDelete,
 		},
 		Responses:  c.Responses,
 		Embeddings: c.Embeddings,
 		AutoDelete: c.AutoDelete,
 		HistorySplit: HistorySplitConfig{
 			Enabled:           cloneBoolPtr(c.HistorySplit.Enabled),
 			TriggerAfterTurns: cloneIntPtr(c.HistorySplit.TriggerAfterTurns),
 		},
 		CurrentInputFile: CurrentInputFileConfig{
 			Enabled:  cloneBoolPtr(c.CurrentInputFile.Enabled),
 			MinChars: c.CurrentInputFile.MinChars,
@@ -208,14 +193,6 @@ func cloneBoolPtr(in *bool) *bool {
 	return &v
 }
 func cloneIntPtr(in *int) *int {
 	if in == nil {
 		return nil
 	}
 	v := *in
 	return &v
 }
 func parseConfigString(raw string) (Config, error) {
 	var cfg Config
 	candidates := []string{raw}
--- a/internal/config/config.go
+++ b/internal/config/config.go
@@ -15,11 +15,9 @@ type Config struct {
 	ModelAliases      map[string]string       `json:"model_aliases,omitempty"`
 	Admin             AdminConfig             `json:"admin,omitempty"`
 	Runtime           RuntimeConfig           `json:"runtime,omitempty"`
 	Compat            CompatConfig            `json:"compat,omitempty"`
 	Responses         ResponsesConfig         `json:"responses,omitempty"`
 	Embeddings        EmbeddingsConfig        `json:"embeddings,omitempty"`
 	AutoDelete        AutoDeleteConfig        `json:"auto_delete"`
 	HistorySplit      HistorySplitConfig      `json:"history_split"`
 	CurrentInputFile  CurrentInputFileConfig  `json:"current_input_file,omitempty"`
 	ThinkingInjection ThinkingInjectionConfig `json:"thinking_injection,omitempty"`
 	VercelSyncHash    string                  `json:"_vercel_sync_hash,omitempty"`
@@ -142,11 +140,6 @@ func (c *Config) normalizeModelAliases() {
 	}
 }
 type CompatConfig struct {
 	WideInputStrictOutput *bool `json:"wide_input_strict_output,omitempty"`
 	StripReferenceMarkers *bool `json:"strip_reference_markers,omitempty"`
 }
 type AdminConfig struct {
 	PasswordHash      string `json:"password_hash,omitempty"`
 	JWTExpireHours    int    `json:"jwt_expire_hours,omitempty"`
@@ -173,11 +166,6 @@ type AutoDeleteConfig struct {
 	Sessions bool   `json:"sessions,omitempty"`
 }
 type HistorySplitConfig struct {
 	Enabled           *bool `json:"enabled,omitempty"`
 	TriggerAfterTurns *int  `json:"trigger_after_turns,omitempty"`
 }
 type CurrentInputFileConfig struct {
 	Enabled  *bool `json:"enabled,omitempty"`
 	MinChars int   `json:"min_chars,omitempty"`
--- a/internal/config/config_edge_test.go
+++ b/internal/config/config_edge_test.go
@@ -79,13 +79,20 @@ func TestGetModelConfigDeepSeekExpertReasonerSearch(t *testing.T) {
 	}
 }
-func TestGetModelConfigDeepSeekVisionReasonerSearch(t *testing.T) {
+func TestGetModelConfigDeepSeekVision(t *testing.T) {
-	thinking, search, ok := GetModelConfig("deepseek-v4-vision-search")
+	thinking, search, ok := GetModelConfig("deepseek-v4-vision")
 	if !ok {
-		t.Fatal("expected ok for deepseek-v4-vision-search")
+		t.Fatal("expected ok for deepseek-v4-vision")
 	}
-	if !thinking || !search {
+	if !thinking || search {
-		t.Fatalf("expected both true, got thinking=%v search=%v", thinking, search)
+		t.Fatalf("expected thinking=true search=false, got thinking=%v search=%v", thinking, search)
 	}
 }
 func TestGetModelConfigDeepSeekVisionSearchUnsupported(t *testing.T) {
 	_, _, ok := GetModelConfig("deepseek-v4-vision-search")
 	if ok {
 		t.Fatal("expected deepseek-v4-vision-search to be unsupported")
 	}
 }
@@ -156,8 +163,6 @@ func TestLowerFunction(t *testing.T) {
 // ─── Config.MarshalJSON / UnmarshalJSON roundtrip ────────────────────
 func TestConfigJSONRoundtrip(t *testing.T) {
 	trueVal := true
 	falseVal := false
 	cfg := Config{
 		Keys:         []string{"key1", "key2"},
 		Accounts:     []Account{{Email: "user@example.com", Password: "pass", Token: "tok"}},
@@ -165,17 +170,9 @@ func TestConfigJSONRoundtrip(t *testing.T) {
 		AutoDelete: AutoDeleteConfig{
 			Mode: "single",
 		},
 		HistorySplit: HistorySplitConfig{
 			Enabled:           &trueVal,
 			TriggerAfterTurns: func() *int { v := 2; return &v }(),
 		},
 		Runtime: RuntimeConfig{
 			TokenRefreshIntervalHours: 12,
 		},
 		Compat: CompatConfig{
 			WideInputStrictOutput: &trueVal,
 			StripReferenceMarkers: &falseVal,
 		},
 		VercelSyncHash: "hash123",
 		VercelSyncTime: 1234567890,
 		AdditionalFields: map[string]any{
@@ -208,18 +205,6 @@ func TestConfigJSONRoundtrip(t *testing.T) {
 	if decoded.AutoDelete.Mode != "single" {
 		t.Fatalf("unexpected auto delete mode: %#v", decoded.AutoDelete.Mode)
 	}
 	if decoded.HistorySplit.Enabled == nil || !*decoded.HistorySplit.Enabled {
 		t.Fatalf("unexpected history split enabled: %#v", decoded.HistorySplit.Enabled)
 	}
 	if decoded.HistorySplit.TriggerAfterTurns == nil || *decoded.HistorySplit.TriggerAfterTurns != 2 {
 		t.Fatalf("unexpected history split trigger_after_turns: %#v", decoded.HistorySplit.TriggerAfterTurns)
 	}
 	if decoded.Compat.WideInputStrictOutput == nil || !*decoded.Compat.WideInputStrictOutput {
 		t.Fatalf("unexpected compat wide_input_strict_output: %#v", decoded.Compat.WideInputStrictOutput)
 	}
 	if decoded.Compat.StripReferenceMarkers == nil || *decoded.Compat.StripReferenceMarkers {
 		t.Fatalf("unexpected compat strip_reference_markers: %#v", decoded.Compat.StripReferenceMarkers)
 	}
 	if decoded.VercelSyncHash != "hash123" {
 		t.Fatalf("unexpected vercel sync hash: %q", decoded.VercelSyncHash)
 	}
@@ -283,23 +268,31 @@ func TestConfigUnmarshalJSONIgnoresRemovedLegacyModelMappings(t *testing.T) {
 	}
 }
 func TestConfigUnmarshalJSONIgnoresRemovedHistorySplit(t *testing.T) {
 	raw := `{"keys":["k1"],"history_split":{"enabled":true,"trigger_after_turns":2}}`
 	var cfg Config
 	if err := json.Unmarshal([]byte(raw), &cfg); err != nil {
 		t.Fatalf("unmarshal error: %v", err)
 	}
 	if _, ok := cfg.AdditionalFields["history_split"]; ok {
 		t.Fatalf("expected removed legacy field not to persist in additional fields: %#v", cfg.AdditionalFields)
 	}
 	out, err := json.Marshal(cfg)
 	if err != nil {
 		t.Fatalf("marshal error: %v", err)
 	}
 	if strings.Contains(string(out), "history_split") {
 		t.Fatalf("expected removed history_split field not to marshal, got %s", out)
 	}
 }
 // ─── Config.Clone ────────────────────────────────────────────────────
 func TestConfigCloneIsDeepCopy(t *testing.T) {
 	falseVal := false
 	trueVal := true
 	turns := 2
 	cfg := Config{
-		Keys:         []string{"key1"},
+		Keys:             []string{"key1"},
-		Accounts:     []Account{{Email: "user@test.com", Token: "token"}},
+		Accounts:         []Account{{Email: "user@test.com", Token: "token"}},
-		ModelAliases: map[string]string{"claude-sonnet-4-6": "deepseek-v4-flash"},
+		ModelAliases:     map[string]string{"claude-sonnet-4-6": "deepseek-v4-flash"},
 		Compat: CompatConfig{
 			StripReferenceMarkers: &falseVal,
 		},
 		HistorySplit: HistorySplitConfig{
 			Enabled:           &trueVal,
 			TriggerAfterTurns: &turns,
 		},
 		AdditionalFields: map[string]any{"custom": "value"},
 	}
@@ -309,15 +302,6 @@ func TestConfigCloneIsDeepCopy(t *testing.T) {
 	cfg.Keys[0] = "modified"
 	cfg.Accounts[0].Email = "modified@test.com"
 	cfg.ModelAliases["claude-sonnet-4-6"] = "modified-model"
 	if cfg.Compat.StripReferenceMarkers != nil {
 		*cfg.Compat.StripReferenceMarkers = true
 	}
 	if cfg.HistorySplit.Enabled != nil {
 		*cfg.HistorySplit.Enabled = false
 	}
 	if cfg.HistorySplit.TriggerAfterTurns != nil {
 		*cfg.HistorySplit.TriggerAfterTurns = 5
 	}
 	// Cloned should not be affected
 	if cloned.Keys[0] != "key1" {
@@ -329,15 +313,6 @@ func TestConfigCloneIsDeepCopy(t *testing.T) {
 	if cloned.ModelAliases["claude-sonnet-4-6"] != "deepseek-v4-flash" {
 		t.Fatalf("clone model aliases was affected: %#v", cloned.ModelAliases)
 	}
 	if cloned.Compat.StripReferenceMarkers == nil || *cloned.Compat.StripReferenceMarkers {
 		t.Fatalf("clone compat was affected: %#v", cloned.Compat.StripReferenceMarkers)
 	}
 	if cloned.HistorySplit.Enabled == nil || !*cloned.HistorySplit.Enabled {
 		t.Fatalf("clone history split enabled was affected: %#v", cloned.HistorySplit.Enabled)
 	}
 	if cloned.HistorySplit.TriggerAfterTurns == nil || *cloned.HistorySplit.TriggerAfterTurns != 2 {
 		t.Fatalf("clone history split trigger was affected: %#v", cloned.HistorySplit.TriggerAfterTurns)
 	}
 }
 func TestConfigCloneNilMaps(t *testing.T) {
@@ -476,53 +451,9 @@ func TestStoreFindAccountNotFound(t *testing.T) {
 	}
 }
-func TestStoreCompatWideInputStrictOutputDefaultTrue(t *testing.T) {
+func TestStoreIgnoresRemovedCompatConfig(t *testing.T) {
 	t.Setenv("DS2API_CONFIG_JSON", `{"keys":["k1"],"accounts":[]}`)
 	store := LoadStore()
 	if !store.CompatWideInputStrictOutput() {
 		t.Fatal("expected default wide_input_strict_output=true when unset")
 	}
 }
 func TestStoreCompatWideInputStrictOutputCanDisable(t *testing.T) {
 	t.Setenv("DS2API_CONFIG_JSON", `{"keys":["k1"],"accounts":[],"compat":{"wide_input_strict_output":false}}`)
 	store := LoadStore()
 	if store.CompatWideInputStrictOutput() {
 		t.Fatal("expected wide_input_strict_output=false when explicitly configured")
 	}
 	snap := store.Snapshot()
 	data, err := snap.MarshalJSON()
 	if err != nil {
 		t.Fatalf("marshal failed: %v", err)
 	}
 	var out map[string]any
 	if err := json.Unmarshal(data, &out); err != nil {
 		t.Fatalf("decode failed: %v", err)
 	}
 	rawCompat, ok := out["compat"].(map[string]any)
 	if !ok {
 		t.Fatalf("expected compat in marshaled output, got %#v", out)
 	}
 	if rawCompat["wide_input_strict_output"] != false {
 		t.Fatalf("expected explicit false in compat, got %#v", rawCompat)
 	}
 }
 func TestStoreCompatStripReferenceMarkersDefaultTrue(t *testing.T) {
 	t.Setenv("DS2API_CONFIG_JSON", `{"keys":["k1"],"accounts":[]}`)
 	store := LoadStore()
 	if !store.CompatStripReferenceMarkers() {
 		t.Fatal("expected default strip_reference_markers=true when unset")
 	}
 }
 func TestStoreCompatStripReferenceMarkersCanDisable(t *testing.T) {
 	t.Setenv("DS2API_CONFIG_JSON", `{"keys":["k1"],"accounts":[],"compat":{"strip_reference_markers":false}}`)
 	store := LoadStore()
 	if store.CompatStripReferenceMarkers() {
 		t.Fatal("expected strip_reference_markers=false when explicitly configured")
 	}
 	snap := store.Snapshot()
 	data, err := snap.MarshalJSON()
@@ -533,12 +464,8 @@ func TestStoreCompatStripReferenceMarkersCanDisable(t *testing.T) {
 	if err := json.Unmarshal(data, &out); err != nil {
 		t.Fatalf("decode failed: %v", err)
 	}
-	rawCompat, ok := out["compat"].(map[string]any)
+	if _, ok := out["compat"]; ok {
-	if !ok {
+		t.Fatalf("expected removed compat field not to marshal, got %#v", out)
 		t.Fatalf("expected compat in marshaled output, got %#v", out)
 	}
 	if rawCompat["strip_reference_markers"] != false {
 		t.Fatalf("expected explicit false in compat, got %#v", rawCompat)
 	}
 }
@@ -748,18 +675,16 @@ func TestOpenAIModelsResponse(t *testing.T) {
 		t.Fatal("expected non-empty models list")
 	}
 	expected := map[string]bool{
-		"deepseek-v4-flash":                    false,
+		"deepseek-v4-flash":                   false,
-		"deepseek-v4-flash-nothinking":         false,
+		"deepseek-v4-flash-nothinking":        false,
-		"deepseek-v4-pro":                      false,
+		"deepseek-v4-pro":                     false,
-		"deepseek-v4-pro-nothinking":           false,
+		"deepseek-v4-pro-nothinking":          false,
-		"deepseek-v4-flash-search":             false,
+		"deepseek-v4-flash-search":            false,
-		"deepseek-v4-flash-search-nothinking":  false,
+		"deepseek-v4-flash-search-nothinking": false,
-		"deepseek-v4-pro-search":               false,
+		"deepseek-v4-pro-search":              false,
-		"deepseek-v4-pro-search-nothinking":    false,
+		"deepseek-v4-pro-search-nothinking":   false,
-		"deepseek-v4-vision":                   false,
+		"deepseek-v4-vision":                  false,
-		"deepseek-v4-vision-nothinking":        false,
+		"deepseek-v4-vision-nothinking":       false,
 		"deepseek-v4-vision-search":            false,
 		"deepseek-v4-vision-search-nothinking": false,
 	}
 	for _, model := range data {
 		if _, ok := expected[model.ID]; ok {
--- a/internal/config/config_test.go
+++ b/internal/config/config_test.go
@@ -144,6 +144,44 @@ func TestLoadStoreIgnoresLegacyConfigJSONEnv(t *testing.T) {
 	}
 }
 func TestExplicitMissingConfigPathBootstrapsEmptyFileBackedStore(t *testing.T) {
 	path := t.TempDir() + "/config.json"
 	t.Setenv("DS2API_CONFIG_JSON", "")
 	t.Setenv("DS2API_CONFIG_PATH", path)
 	store, err := LoadStoreWithError()
 	if err != nil {
 		t.Fatalf("expected missing explicit config path to bootstrap, got: %v", err)
 	}
 	if store.IsEnvBacked() {
 		t.Fatal("expected bootstrap store to be file-backed")
 	}
 	if store.ConfigPath() != path {
 		t.Fatalf("ConfigPath() = %q, want %q", store.ConfigPath(), path)
 	}
 	if len(store.Keys()) != 0 || len(store.Accounts()) != 0 {
 		t.Fatalf("expected empty bootstrap config, got keys=%d accounts=%d", len(store.Keys()), len(store.Accounts()))
 	}
 	if _, statErr := os.Stat(path); !errors.Is(statErr, os.ErrNotExist) {
 		t.Fatalf("expected bootstrap not to create config until first save, stat err=%v", statErr)
 	}
 	if err := store.Update(func(c *Config) error {
 		c.Keys = []string{"first-key"}
 		return nil
 	}); err != nil {
 		t.Fatalf("update should persist bootstrap config: %v", err)
 	}
 	content, err := os.ReadFile(path)
 	if err != nil {
 		t.Fatalf("expected first update to write config: %v", err)
 	}
 	if !strings.Contains(string(content), "first-key") {
 		t.Fatalf("expected saved config to contain first key, got: %s", content)
 	}
 }
 func TestEnvBackedStoreWritebackBootstrapsMissingConfigFile(t *testing.T) {
 	tmp, err := os.CreateTemp(t.TempDir(), "config-*.json")
 	if err != nil {
--- a/internal/config/model_alias_test.go
+++ b/internal/config/model_alias_test.go
@@ -144,10 +144,17 @@ func TestResolveModelCustomAliasToExpert(t *testing.T) {
 func TestResolveModelCustomAliasToVision(t *testing.T) {
 	got, ok := ResolveModel(mockModelAliasReader{
-		"my-vision-model": "deepseek-v4-vision-search",
+		"my-vision-model": "deepseek-v4-vision",
 	}, "my-vision-model")
-	if !ok || got != "deepseek-v4-vision-search" {
+	if !ok || got != "deepseek-v4-vision" {
-		t.Fatalf("expected alias -> deepseek-v4-vision-search, got ok=%v model=%q", ok, got)
+		t.Fatalf("expected alias -> deepseek-v4-vision, got ok=%v model=%q", ok, got)
 	}
 }
 func TestResolveModelHeuristicVisionIgnoresSearchSuffix(t *testing.T) {
 	got, ok := ResolveModel(nil, "gemini-vision-search")
 	if !ok || got != "deepseek-v4-vision" {
 		t.Fatalf("expected heuristic vision alias to resolve without search variant, got ok=%v model=%q", ok, got)
 	}
 }
--- a/internal/config/models.go
+++ b/internal/config/models.go
@@ -22,7 +22,6 @@ var deepSeekBaseModels = []ModelInfo{
 	{ID: "deepseek-v4-flash-search", Object: "model", Created: 1677610602, OwnedBy: "deepseek", Permission: []any{}},
 	{ID: "deepseek-v4-pro-search", Object: "model", Created: 1677610602, OwnedBy: "deepseek", Permission: []any{}},
 	{ID: "deepseek-v4-vision", Object: "model", Created: 1677610602, OwnedBy: "deepseek", Permission: []any{}},
 	{ID: "deepseek-v4-vision-search", Object: "model", Created: 1677610602, OwnedBy: "deepseek", Permission: []any{}},
 }
 var DeepSeekModels = appendNoThinkingVariants(deepSeekBaseModels)
@@ -67,7 +66,7 @@ func GetModelConfig(model string) (thinking bool, search bool, ok bool) {
 	switch baseModel {
 	case "deepseek-v4-flash", "deepseek-v4-pro", "deepseek-v4-vision":
 		return !noThinking, false, true
-	case "deepseek-v4-flash-search", "deepseek-v4-pro-search", "deepseek-v4-vision-search":
+	case "deepseek-v4-flash-search", "deepseek-v4-pro-search":
 		return !noThinking, true, true
 	default:
 		return false, false, false
@@ -81,7 +80,7 @@ func GetModelType(model string) (modelType string, ok bool) {
 		return "default", true
 	case "deepseek-v4-pro", "deepseek-v4-pro-search":
 		return "expert", true
-	case "deepseek-v4-vision", "deepseek-v4-vision-search":
+	case "deepseek-v4-vision":
 		return "vision", true
 	default:
 		return "", false
@@ -359,8 +358,6 @@ func resolveCanonicalModel(aliases map[string]string, model string) (string, boo
 	useSearch := strings.Contains(model, "search")
 	switch {
 	case useVision && useSearch:
 		return "deepseek-v4-vision-search", true
 	case useVision:
 		return "deepseek-v4-vision", true
 	case useReasoner && useSearch:
--- a/internal/config/paths.go
+++ b/internal/config/paths.go
@@ -30,9 +30,29 @@ func ResolvePath(envKey, defaultRel string) string {
 }
 func ConfigPath() string {
 	if strings.TrimSpace(os.Getenv("DS2API_CONFIG_PATH")) == "" && BaseDir() == "/app" {
 		return containerDefaultConfigPath()
 	}
 	return ResolvePath("DS2API_CONFIG_PATH", "config.json")
 }
 func containerDefaultConfigPath() string {
 	// Container images run as non-root by default. Only use /data when mounted/provisioned.
 	// Otherwise keep /app/config.json so admin-side save does not fail on MkdirAll("/data").
 	if st, err := os.Stat("/data"); err == nil && st.IsDir() {
 		return "/data/config.json"
 	}
 	return "/app/config.json"
 }
 func legacyContainerConfigPath() string {
 	return "/app/config.json"
 }
 func shouldTryLegacyContainerConfigPath() bool {
 	return strings.TrimSpace(os.Getenv("DS2API_CONFIG_PATH")) == "" && BaseDir() == "/app"
 }
 func RawStreamSampleRoot() string {
 	return ResolvePath("DS2API_RAW_STREAM_SAMPLE_ROOT", "tests/raw_stream_samples")
 }
--- a/internal/config/paths_test.go
+++ b/internal/config/paths_test.go
@@ -0,0 +1,28 @@
 package config
 import (
 	"os"
 	"testing"
 )
 func TestContainerDefaultConfigPath(t *testing.T) {
 	t.Run("fallback to /app when /data is missing", func(t *testing.T) {
 		// This test environment does not guarantee a writable/mounted /data.
 		// If /data is absent we must keep /app fallback to avoid persistence failures.
 		if _, err := os.Stat("/data"); err == nil {
 			t.Skip("/data exists in this environment; cannot validate missing-/data fallback")
 		}
 		if got := containerDefaultConfigPath(); got != "/app/config.json" {
 			t.Fatalf("containerDefaultConfigPath() = %q, want %q", got, "/app/config.json")
 		}
 	})
 	t.Run("prefer /data when /data directory exists", func(t *testing.T) {
 		if _, err := os.Stat("/data"); err != nil {
 			t.Skip("/data does not exist in this environment")
 		}
 		if got := containerDefaultConfigPath(); got != "/data/config.json" {
 			t.Fatalf("containerDefaultConfigPath() = %q, want %q", got, "/data/config.json")
 		}
 	})
 }
--- a/internal/config/store.go
+++ b/internal/config/store.go
@@ -52,11 +52,12 @@ func loadStore() (*Store, error) {
 func loadConfig() (Config, bool, error) {
 	rawCfg := strings.TrimSpace(os.Getenv("DS2API_CONFIG_JSON"))
 	path := ConfigPath()
 	if rawCfg != "" {
 		cfg, err := parseConfigString(rawCfg)
 		if err != nil {
 			if !IsVercel() && envWritebackEnabled() {
-				if fileCfg, fileErr := loadConfigFromFile(ConfigPath()); fileErr == nil {
+				if fileCfg, fileErr := loadConfigFromFile(path); fileErr == nil {
 					return fileCfg, false, nil
 				}
 			}
@@ -67,7 +68,7 @@ func loadConfig() (Config, bool, error) {
 		if IsVercel() || !envWritebackEnabled() {
 			return cfg, true, err
 		}
-		content, fileErr := os.ReadFile(ConfigPath())
+		content, fileErr := os.ReadFile(path)
 		if fileErr == nil {
 			var fileCfg Config
 			if unmarshalErr := json.Unmarshal(content, &fileCfg); unmarshalErr == nil {
@@ -79,7 +80,7 @@ func loadConfig() (Config, bool, error) {
 			if validateErr := ValidateConfig(cfg); validateErr != nil {
 				return cfg, true, validateErr
 			}
-			if writeErr := writeConfigFile(ConfigPath(), cfg.Clone()); writeErr == nil {
+			if writeErr := writeConfigFile(path, cfg.Clone()); writeErr == nil {
 				return cfg, false, err
 			} else {
 				Logger.Warn("[config] env writeback bootstrap failed", "error", writeErr)
@@ -87,14 +88,23 @@ func loadConfig() (Config, bool, error) {
 		}
 		return cfg, true, err
 	}
-
+	cfg, err := loadConfigFromFile(path)
 	cfg, err := loadConfigFromFile(ConfigPath())
 	if err != nil {
 		if shouldTryLegacyContainerConfigPath() {
 			legacyPath := legacyContainerConfigPath()
 			if legacyCfg, legacyErr := loadConfigFromFile(legacyPath); legacyErr == nil {
 				Logger.Info("[config] loaded legacy container config path", "path", legacyPath)
 				return legacyCfg, false, nil
 			}
 		}
 		if IsVercel() {
-			// Vercel one-click deploy may start without a writable/present config file.
+			// Vercel may start without writable/present config; keep in-memory bootstrap config.
 			// Keep an in-memory config so users can bootstrap via WebUI then sync env.
 			return Config{}, true, nil
 		}
 		if shouldBootstrapMissingConfigFile(err) {
 			Logger.Warn("[config] config file missing; starting with empty file-backed config", "path", path)
 			return Config{}, false, nil
 		}
 		return Config{}, false, err
 	}
 	if IsVercel() {
@@ -104,6 +114,10 @@ func loadConfig() (Config, bool, error) {
 	return cfg, false, nil
 }
 func shouldBootstrapMissingConfigFile(err error) bool {
 	return errors.Is(err, os.ErrNotExist) && strings.TrimSpace(os.Getenv("DS2API_CONFIG_PATH")) != ""
 }
 func loadConfigFromFile(path string) (Config, error) {
 	content, err := os.ReadFile(path)
 	if err != nil {
--- a/internal/config/store_accessors.go
+++ b/internal/config/store_accessors.go
@@ -21,24 +21,6 @@ func (s *Store) ModelAliases() map[string]string {
 	return out
 }
 func (s *Store) CompatWideInputStrictOutput() bool {
 	s.mu.RLock()
 	defer s.mu.RUnlock()
 	if s.cfg.Compat.WideInputStrictOutput == nil {
 		return true
 	}
 	return *s.cfg.Compat.WideInputStrictOutput
 }
 func (s *Store) CompatStripReferenceMarkers() bool {
 	s.mu.RLock()
 	defer s.mu.RUnlock()
 	if s.cfg.Compat.StripReferenceMarkers == nil {
 		return true
 	}
 	return *s.cfg.Compat.StripReferenceMarkers
 }
 func (s *Store) ToolcallMode() string {
 	return "feature_match"
 }
@@ -163,14 +145,6 @@ func (s *Store) AutoDeleteSessions() bool {
 	return s.AutoDeleteMode() != "none"
 }
 func (s *Store) HistorySplitEnabled() bool {
 	return false
 }
 func (s *Store) HistorySplitTriggerAfterTurns() int {
 	return 1
 }
 func (s *Store) CurrentInputFileEnabled() bool {
 	s.mu.RLock()
 	defer s.mu.RUnlock()
--- a/internal/config/store_accessors_test.go
+++ b/internal/config/store_accessors_test.go
@@ -2,21 +2,6 @@ package config
 import "testing"
 func TestStoreHistorySplitAccessors(t *testing.T) {
 	enabled := true
 	turns := 3
 	store := &Store{cfg: Config{HistorySplit: HistorySplitConfig{
 		Enabled:           &enabled,
 		TriggerAfterTurns: &turns,
 	}}}
 	if store.HistorySplitEnabled() {
 		t.Fatal("expected history split to stay disabled")
 	}
 	if got := store.HistorySplitTriggerAfterTurns(); got != 1 {
 		t.Fatalf("history split trigger_after_turns=%d want=1", got)
 	}
 }
 func TestStoreCurrentInputFileAccessors(t *testing.T) {
 	store := &Store{cfg: Config{}}
 	if !store.CurrentInputFileEnabled() {
@@ -40,12 +25,6 @@ func TestStoreCurrentInputFileAccessors(t *testing.T) {
 	if got := store.CurrentInputFileMinChars(); got != 12345 {
 		t.Fatalf("current input file min_chars=%d want=12345", got)
 	}
 	historyEnabled := true
 	store.cfg.HistorySplit.Enabled = &historyEnabled
 	if !store.CurrentInputFileEnabled() {
 		t.Fatal("expected history split config to not suppress current input file mode")
 	}
 }
 func TestStoreThinkingInjectionAccessors(t *testing.T) {
--- a/internal/deepseek/client/client_continue.go
+++ b/internal/deepseek/client/client_continue.go
@@ -7,6 +7,7 @@ import (
 	dsprotocol "ds2api/internal/deepseek/protocol"
 	"encoding/json"
 	"errors"
 	"fmt"
 	"io"
 	"net/http"
 	"strings"
@@ -27,7 +28,7 @@ type continueState struct {
 }
 // wrapCompletionWithAutoContinue wraps the completion response body so that
-// if the upstream indicates the response is incomplete (WIP / INCOMPLETE /
+// if the upstream indicates the response is incomplete (INCOMPLETE /
 // AUTO_CONTINUE), ds2api will automatically call the DeepSeek continue
 // endpoint and splice the continuation SSE stream onto the original.
 // The caller sees a single, seamless SSE stream.
@@ -132,33 +133,51 @@ func pumpAutoContinue(ctx context.Context, pw *io.PipeWriter, initial io.ReadClo
 // sentinels are consumed (not forwarded) so that the downstream only sees
 // one final [DONE] at the very end.
 func streamBodyWithContinueState(ctx context.Context, pw *io.PipeWriter, body io.Reader, state *continueState) (bool, error) {
-	scanner := bufio.NewScanner(body)
+	reader := bufio.NewReaderSize(body, 64*1024)
 	scanner.Buffer(make([]byte, 0, 64*1024), 2*1024*1024)
 	hadDone := false
-	for scanner.Scan() {
+	for {
 		select {
 		case <-ctx.Done():
 			return hadDone, ctx.Err()
 		default:
 		}
-		line := append([]byte{}, scanner.Bytes()...)
+		line, err := reader.ReadBytes('\n')
-		trimmed := strings.TrimSpace(string(line))
+		if len(line) == 0 && err != nil {
-		if trimmed == "" {
+			if err == io.EOF {
-			continue
+				return hadDone, nil
 		}
 		if strings.HasPrefix(trimmed, "data:") {
 			data := strings.TrimSpace(strings.TrimPrefix(trimmed, "data:"))
 			if data == "[DONE]" {
 				hadDone = true
 				continue
 			}
-			state.observe(data)
+			return hadDone, err
 		}
-		if _, err := io.Copy(pw, bytes.NewReader(append(line, '\n'))); err != nil {
+		trimmed := strings.TrimSpace(string(line))
 		if trimmed != "" {
 			if strings.HasPrefix(trimmed, "data:") {
 				data := strings.TrimSpace(strings.TrimPrefix(trimmed, "data:"))
 				if data == "[DONE]" {
 					hadDone = true
 					if err != nil && err != io.EOF {
 						return hadDone, err
 					}
 					if err == io.EOF {
 						return hadDone, nil
 					}
 					continue
 				}
 				state.observe(data)
 			}
 			if !strings.HasSuffix(string(line), "\n") {
 				line = append(line, '\n')
 			}
 			if _, copyErr := io.Copy(pw, bytes.NewReader(line)); copyErr != nil {
 				return hadDone, copyErr
 			}
 		}
 		if err != nil {
 			if err == io.EOF {
 				return hadDone, nil
 			}
 			return hadDone, err
 		}
 	}
 	return hadDone, scanner.Err()
 }
 // observe extracts continue-relevant signals from an SSE JSON chunk.
@@ -174,49 +193,100 @@ func (s *continueState) observe(data string) {
 	if id := intFrom(chunk["response_message_id"]); id > 0 {
 		s.responseMessageID = id
 	}
-	// Path-based status: {"p": "response/status", "v": "FINISHED"}
+	s.observeDirectPatch(asString(chunk["p"]), chunk["v"])
-	if p, _ := chunk["p"].(string); p == "response/status" {
+	if p, _ := chunk["p"].(string); p == "response" {
-		if status, _ := chunk["v"].(string); status != "" {
+		s.observeBatchPatches("response", chunk["v"])
-			s.lastStatus = strings.TrimSpace(status)
+	} else {
-			if strings.EqualFold(s.lastStatus, "FINISHED") {
+		s.observeBatchPatches("", chunk["v"])
 				s.finished = true
 			}
 		}
 	}
-	// Nested v.response
+	if v, _ := chunk["v"].(map[string]any); v != nil {
-	v, _ := chunk["v"].(map[string]any)
+		s.observeResponseObject(v["response"])
-	if response, _ := v["response"].(map[string]any); response != nil {
+	}
-		if id := intFrom(response["message_id"]); id > 0 {
+	if message, _ := chunk["message"].(map[string]any); message != nil {
-			s.responseMessageID = id
+		s.observeResponseObject(message["response"])
-		}
+	}
-		if status, _ := response["status"].(string); status != "" {
+}
-			s.lastStatus = strings.TrimSpace(status)
+
-			if strings.EqualFold(s.lastStatus, "FINISHED") {
+func (s *continueState) observeDirectPatch(path string, value any) {
-				s.finished = true
+	if s == nil {
-			}
+		return
-		}
+	}
-		if autoContinue, ok := response["auto_continue"].(bool); ok && autoContinue {
+	switch strings.Trim(strings.TrimSpace(path), "/") {
 	case "response/status", "status", "response/quasi_status", "quasi_status":
 		s.setStatus(asString(value))
 	case "response/auto_continue", "auto_continue":
 		if v, ok := value.(bool); ok && v {
 			s.lastStatus = "AUTO_CONTINUE"
 		}
 	}
-	// Nested message.response
+}
-	if message, _ := chunk["message"].(map[string]any); message != nil {
+
-		if response, _ := message["response"].(map[string]any); response != nil {
+func (s *continueState) observeResponseObject(raw any) {
-			if id := intFrom(response["message_id"]); id > 0 {
+	if s == nil {
-				s.responseMessageID = id
+		return
-			}
+	}
-			if status, _ := response["status"].(string); status != "" {
+	response, _ := raw.(map[string]any)
-				s.lastStatus = strings.TrimSpace(status)
+	if response == nil {
-				if strings.EqualFold(s.lastStatus, "FINISHED") {
+		return
-					s.finished = true
+	}
-				}
+	if id := intFrom(response["message_id"]); id > 0 {
 		s.responseMessageID = id
 	}
 	s.setStatus(asString(response["status"]))
 	if autoContinue, ok := response["auto_continue"].(bool); ok && autoContinue {
 		s.lastStatus = "AUTO_CONTINUE"
 	}
 }
 func (s *continueState) observeBatchPatches(parentPath string, raw any) {
 	if s == nil {
 		return
 	}
 	patches, ok := raw.([]any)
 	if !ok {
 		return
 	}
 	for _, patch := range patches {
 		m, ok := patch.(map[string]any)
 		if !ok {
 			continue
 		}
 		path := strings.TrimSpace(asString(m["p"]))
 		if path == "" {
 			continue
 		}
 		fullPath := path
 		if parent := strings.Trim(strings.TrimSpace(parentPath), "/"); parent != "" && !strings.Contains(path, "/") {
 			fullPath = parent + "/" + path
 		}
 		switch strings.Trim(strings.TrimSpace(fullPath), "/") {
 		case "response/status", "status", "response/quasi_status", "quasi_status":
 			s.setStatus(asString(m["v"]))
 		case "response/auto_continue", "auto_continue":
 			if v, ok := m["v"].(bool); ok && v {
 				s.lastStatus = "AUTO_CONTINUE"
 			}
 		}
 	}
 }
-// shouldContinue returns true when the upstream indicates the response is
+func (s *continueState) setStatus(status string) {
-// not yet finished and we have enough information to issue a continue request.
+	if s == nil {
 		return
 	}
 	normalized := strings.TrimSpace(status)
 	if normalized == "" {
 		return
 	}
 	s.lastStatus = normalized
 	if strings.EqualFold(normalized, "FINISHED") || strings.EqualFold(normalized, "CONTENT_FILTER") {
 		s.finished = true
 	}
 }
 // shouldContinue returns true when the upstream explicitly indicates the
 // response is incomplete and we have enough information to issue a continue
 // request. Plain WIP is not sufficient because normal streams begin in WIP.
 func (s *continueState) shouldContinue() bool {
 	if s == nil {
 		return false
@@ -225,7 +295,7 @@ func (s *continueState) shouldContinue() bool {
 		return false
 	}
 	switch strings.ToUpper(strings.TrimSpace(s.lastStatus)) {
-	case "WIP", "INCOMPLETE", "AUTO_CONTINUE":
+	case "INCOMPLETE", "AUTO_CONTINUE":
 		return true
 	default:
 		return false
@@ -241,3 +311,19 @@ func (s *continueState) prepareForNextRound() {
 	s.finished = false
 	s.lastStatus = ""
 }
 func asString(v any) string {
 	if v == nil {
 		return ""
 	}
 	switch x := v.(type) {
 	case string:
 		return x
 	default:
 		s := strings.TrimSpace(strings.ReplaceAll(strings.TrimSpace(fmt.Sprint(v)), "\u0000", ""))
 		if s == "<nil>" {
 			return ""
 		}
 		return s
 	}
 }
--- a/internal/deepseek/client/client_continue_test.go
+++ b/internal/deepseek/client/client_continue_test.go
@@ -8,6 +8,7 @@ import (
 	"io"
 	"net/http"
 	"strings"
 	"sync/atomic"
 	"testing"
 	"ds2api/internal/auth"
@@ -124,6 +125,146 @@ func TestCallCompletionAutoContinueThreadsPowHeader(t *testing.T) {
 	}
 }
 func TestAutoContinueDoesNotTriggerOnPlainWIPWithoutExplicitContinuationSignal(t *testing.T) {
 	initialBody := strings.Join([]string{
 		`data: {"response_message_id":321,"v":{"response":{"message_id":321,"status":"WIP","auto_continue":false}}}`,
 		`data: [DONE]`,
 	}, "\n") + "\n"
 	var continueCalls atomic.Int32
 	body := newAutoContinueBody(context.Background(), io.NopCloser(strings.NewReader(initialBody)), "session-123", 8, func(context.Context, string, int) (*http.Response, error) {
 		continueCalls.Add(1)
 		return nil, errors.New("continue should not have been called")
 	})
 	defer func() { _ = body.Close() }()
 	out, err := io.ReadAll(body)
 	if err != nil {
 		t.Fatalf("read body failed: %v", err)
 	}
 	if continueCalls.Load() != 0 {
 		t.Fatalf("expected no continue calls, got %d", continueCalls.Load())
 	}
 	if !bytes.Contains(out, []byte(`"status":"WIP"`)) || !bytes.Contains(out, []byte(`data: [DONE]`)) {
 		t.Fatalf("expected original body to pass through unchanged, got=%s", string(out))
 	}
 }
 func TestAutoContinuePassesThroughLongSingleSSELine(t *testing.T) {
 	payload := strings.Repeat("x", 2*1024*1024+4096)
 	initialBody := `data: {"p":"response/content","v":"` + payload + `"}` + "\n" +
 		`data: [DONE]` + "\n"
 	body := newAutoContinueBody(context.Background(), io.NopCloser(strings.NewReader(initialBody)), "session-123", 8, func(context.Context, string, int) (*http.Response, error) {
 		return nil, errors.New("continue should not have been called")
 	})
 	defer func() { _ = body.Close() }()
 	out, err := io.ReadAll(body)
 	if err != nil {
 		t.Fatalf("read body failed: %v", err)
 	}
 	if !bytes.Contains(out, []byte(payload)) {
 		t.Fatalf("expected long SSE payload to pass through, got len=%d want payload len=%d", len(out), len(payload))
 	}
 	if !bytes.Contains(out, []byte(`data: [DONE]`)) {
 		t.Fatalf("expected final DONE sentinel in body, got len=%d", len(out))
 	}
 }
 func TestAutoContinueTriggersOnDirectQuasiStatusIncomplete(t *testing.T) {
 	initialBody := strings.Join([]string{
 		`data: {"response_message_id":321,"p":"response/content","v":"<tool_calls><invoke name=\"write_file\"><parameter name=\"content\"><![CDATA[part-one"}`,
 		`data: {"p":"response/quasi_status","v":"INCOMPLETE"}`,
 		`data: [DONE]`,
 	}, "\n") + "\n"
 	var continueCalls atomic.Int32
 	body := newAutoContinueBody(context.Background(), io.NopCloser(strings.NewReader(initialBody)), "session-123", 8, func(context.Context, string, int) (*http.Response, error) {
 		continueCalls.Add(1)
 		return &http.Response{
 			StatusCode: http.StatusOK,
 			Header:     make(http.Header),
 			Body: io.NopCloser(strings.NewReader(
 				`data: {"response_message_id":322,"p":"response/content","v":"-part-two]]></parameter></invoke></tool_calls>"}` + "\n" +
 					`data: {"p":"response/status","v":"FINISHED"}` + "\n" +
 					`data: [DONE]` + "\n",
 			)),
 		}, nil
 	})
 	defer func() { _ = body.Close() }()
 	out, err := io.ReadAll(body)
 	if err != nil {
 		t.Fatalf("read body failed: %v", err)
 	}
 	if continueCalls.Load() != 1 {
 		t.Fatalf("expected exactly one continue call, got %d", continueCalls.Load())
 	}
 	if !bytes.Contains(out, []byte("part-one")) || !bytes.Contains(out, []byte("-part-two")) {
 		t.Fatalf("expected continued tool content in body, got=%s", string(out))
 	}
 }
 func TestAutoContinueTriggersOnResponseBatchQuasiStatusIncomplete(t *testing.T) {
 	initialBody := strings.Join([]string{
 		`data: {"response_message_id":321,"v":{"response":{"message_id":321,"status":"WIP","auto_continue":false}}}`,
 		`data: {"p":"response","o":"BATCH","v":[{"p":"accumulated_token_usage","v":2413},{"p":"quasi_status","v":"INCOMPLETE"}]}`,
 		`data: [DONE]`,
 	}, "\n") + "\n"
 	var continueCalls atomic.Int32
 	body := newAutoContinueBody(context.Background(), io.NopCloser(strings.NewReader(initialBody)), "session-123", 8, func(context.Context, string, int) (*http.Response, error) {
 		continueCalls.Add(1)
 		return &http.Response{
 			StatusCode: http.StatusOK,
 			Header:     make(http.Header),
 			Body: io.NopCloser(strings.NewReader(
 				`data: {"response_message_id":322,"p":"response/status","v":"FINISHED"}` + "\n" +
 					`data: [DONE]` + "\n",
 			)),
 		}, nil
 	})
 	defer func() { _ = body.Close() }()
 	out, err := io.ReadAll(body)
 	if err != nil {
 		t.Fatalf("read body failed: %v", err)
 	}
 	if continueCalls.Load() != 1 {
 		t.Fatalf("expected exactly one continue call, got %d", continueCalls.Load())
 	}
 	if !bytes.Contains(out, []byte(`"quasi_status","v":"INCOMPLETE"`)) || !bytes.Contains(out, []byte(`"v":"FINISHED"`)) {
 		t.Fatalf("expected continued output to include initial and final rounds, got=%s", string(out))
 	}
 }
 func TestAutoContinueDoesNotTriggerWhenResponseBatchQuasiStatusFinished(t *testing.T) {
 	initialBody := strings.Join([]string{
 		`data: {"response_message_id":321,"v":{"response":{"message_id":321,"status":"WIP","auto_continue":false}}}`,
 		`data: {"p":"response","o":"BATCH","v":[{"p":"accumulated_token_usage","v":2413},{"p":"quasi_status","v":"FINISHED"}]}`,
 		`data: [DONE]`,
 	}, "\n") + "\n"
 	var continueCalls atomic.Int32
 	body := newAutoContinueBody(context.Background(), io.NopCloser(strings.NewReader(initialBody)), "session-123", 8, func(context.Context, string, int) (*http.Response, error) {
 		continueCalls.Add(1)
 		return nil, errors.New("continue should not have been called")
 	})
 	defer func() { _ = body.Close() }()
 	out, err := io.ReadAll(body)
 	if err != nil {
 		t.Fatalf("read body failed: %v", err)
 	}
 	if continueCalls.Load() != 0 {
 		t.Fatalf("expected no continue calls, got %d", continueCalls.Load())
 	}
 	if !bytes.Contains(out, []byte(`"quasi_status","v":"FINISHED"`)) || !bytes.Contains(out, []byte(`data: [DONE]`)) {
 		t.Fatalf("expected original finished body to pass through unchanged, got=%s", string(out))
 	}
 }
 type failingOrCompletionDoer struct {
 	completionResp *http.Response
 }
@@ -134,3 +275,33 @@ func (d failingOrCompletionDoer) Do(req *http.Request) (*http.Response, error) {
 	}
 	return nil, errors.New("forced stream failure")
 }
 func TestAutoContinuePreservesIncompleteStateWhenNextChunkOmitsStatus(t *testing.T) {
 	initialBody := strings.Join([]string{
 		`data: {"response_message_id":321,"v":{"response":{"message_id":321,"status":"INCOMPLETE"}}}`,
 		`data: {"p":"response/content","v":{"text":"continued"}}`,
 		`data: [DONE]`,
 	}, "\n") + "\n"
 	var continueCalls atomic.Int32
 	body := newAutoContinueBody(context.Background(), io.NopCloser(strings.NewReader(initialBody)), "session-123", 8, func(context.Context, string, int) (*http.Response, error) {
 		continueCalls.Add(1)
 		return &http.Response{
 			StatusCode: http.StatusOK,
 			Header:     make(http.Header),
 			Body: io.NopCloser(strings.NewReader(
 				`data: {"response_message_id":322,"p":"response/status","v":"FINISHED"}` + "\n" +
 					`data: [DONE]` + "\n",
 			)),
 		}, nil
 	})
 	defer func() { _ = body.Close() }()
 	_, err := io.ReadAll(body)
 	if err != nil {
 		t.Fatalf("read body failed: %v", err)
 	}
 	if continueCalls.Load() != 1 {
 		t.Fatalf("expected exactly one continue call, got %d", continueCalls.Load())
 	}
 }
--- a/internal/deepseek/client/client_file_status.go
+++ b/internal/deepseek/client/client_file_status.go
@@ -22,6 +22,9 @@ const (
 var fileReadySleep = time.Sleep
 // ErrUploadFileNotFound indicates that DeepSeek returned no matching uploaded file.
 var ErrUploadFileNotFound = errors.New("uploaded file not found")
 func (c *Client) waitForUploadedFile(ctx context.Context, a *auth.RequestAuth, result *UploadFileResult) error {
 	if result == nil || strings.TrimSpace(result.ID) == "" {
 		return nil
@@ -42,7 +45,7 @@ func (c *Client) waitForUploadedFile(ctx context.Context, a *auth.RequestAuth, r
 			return fmt.Errorf("waiting for file %s to become ready: %w", result.ID, err)
 		}
-		fetched, err := c.fetchUploadedFile(pollCtx, a, result.ID)
+		fetched, err := c.FetchUploadedFile(pollCtx, a, result.ID)
 		if err == nil && fetched != nil {
 			mergeUploadFileResults(result, fetched)
 			if isReadyUploadFileStatus(result.Status) {
@@ -65,7 +68,8 @@ func (c *Client) waitForUploadedFile(ctx context.Context, a *auth.RequestAuth, r
 	return fmt.Errorf("file %s did not become ready: %w", result.ID, lastErr)
 }
-func (c *Client) fetchUploadedFile(ctx context.Context, a *auth.RequestAuth, fileID string) (*UploadFileResult, error) {
+// FetchUploadedFile returns metadata for an uploaded DeepSeek file by ID.
 func (c *Client) FetchUploadedFile(ctx context.Context, a *auth.RequestAuth, fileID string) (*UploadFileResult, error) {
 	fileID = strings.TrimSpace(fileID)
 	if fileID == "" {
 		return nil, errors.New("file id is required")
@@ -92,7 +96,7 @@ func (c *Client) fetchUploadedFile(ctx context.Context, a *auth.RequestAuth, fil
 	result := extractFetchedUploadFileResult(resp, fileID)
 	if result == nil || strings.TrimSpace(result.ID) == "" {
-		return nil, errors.New("fetch files succeeded without matching file data")
+		return nil, ErrUploadFileNotFound
 	}
 	result.Raw = resp
 	return result, nil
--- a/internal/deepseek/client/client_upload.go
+++ b/internal/deepseek/client/client_upload.go
@@ -23,6 +23,7 @@ type UploadFileRequest struct {
 	Filename    string
 	ContentType string
 	Purpose     string
 	ModelType   string
 	Data        []byte
 }
@@ -54,6 +55,7 @@ func (c *Client) UploadFile(ctx context.Context, a *auth.RequestAuth, req Upload
 		contentType = "application/octet-stream"
 	}
 	purpose := strings.TrimSpace(req.Purpose)
 	modelType := strings.ToLower(strings.TrimSpace(req.ModelType))
 	body, contentTypeHeader, err := buildUploadMultipartBody(filename, contentType, req.Data)
 	if err != nil {
 		return nil, err
@@ -64,6 +66,9 @@ func (c *Client) UploadFile(ctx context.Context, a *auth.RequestAuth, req Upload
 		"purpose":      purpose,
 		"bytes":        len(req.Data),
 	}
 	if modelType != "" {
 		capturePayload["model_type"] = modelType
 	}
 	captureSession := c.capture.Start("deepseek_upload_file", dsprotocol.DeepSeekUploadFileURL, a.AccountID, capturePayload)
 	attempts := 0
 	refreshed := false
@@ -81,6 +86,9 @@ func (c *Client) UploadFile(ctx context.Context, a *auth.RequestAuth, req Upload
 		}
 		headers := c.authHeaders(a.DeepSeekToken)
 		headers["Content-Type"] = contentTypeHeader
 		if modelType != "" {
 			headers["x-model-type"] = modelType
 		}
 		headers["x-ds-pow-response"] = powHeader
 		headers["x-file-size"] = strconv.Itoa(len(req.Data))
 		headers["x-thinking-enabled"] = "1"
--- a/internal/deepseek/client/client_upload_test.go
+++ b/internal/deepseek/client/client_upload_test.go
@@ -82,6 +82,7 @@ func TestUploadFileUsesUploadTargetPowAndMultipartHeaders(t *testing.T) {
 	var seenTargetPath string
 	var seenContentType string
 	var seenFileSize string
 	var seenModelType string
 	var seenBody string
 	call := 0
 	client := &Client{
@@ -96,6 +97,7 @@ func TestUploadFileUsesUploadTargetPowAndMultipartHeaders(t *testing.T) {
 				seenPow = req.Header.Get("x-ds-pow-response")
 				seenContentType = req.Header.Get("Content-Type")
 				seenFileSize = req.Header.Get("x-file-size")
 				seenModelType = req.Header.Get("x-model-type")
 				seenBody = string(bodyBytes)
 				return &http.Response{StatusCode: http.StatusOK, Header: make(http.Header), Body: io.NopCloser(strings.NewReader(uploadResponse)), Request: req}, nil
 			default:
@@ -112,6 +114,7 @@ func TestUploadFileUsesUploadTargetPowAndMultipartHeaders(t *testing.T) {
 		Filename:    "demo.txt",
 		ContentType: "text/plain",
 		Purpose:     "assistants",
 		ModelType:   "vision",
 		Data:        []byte("hello"),
 	}, 1)
 	if err != nil {
@@ -140,6 +143,9 @@ func TestUploadFileUsesUploadTargetPowAndMultipartHeaders(t *testing.T) {
 	if seenFileSize != "5" {
 		t.Fatalf("expected x-file-size=5, got %q", seenFileSize)
 	}
 	if seenModelType != "vision" {
 		t.Fatalf("expected x-model-type=vision, got %q", seenModelType)
 	}
 	if !strings.HasPrefix(seenContentType, "multipart/form-data; boundary=") {
 		t.Fatalf("expected multipart content type, got %q", seenContentType)
 	}
--- a/internal/deepseek/protocol/constants.go
+++ b/internal/deepseek/protocol/constants.go
@@ -159,6 +159,6 @@ func toStringSet(in []string) map[string]struct{} {
 const (
 	KeepAliveTimeout  = 5
-	StreamIdleTimeout = 90
+	StreamIdleTimeout = 300
-	MaxKeepaliveCount = 10
+	MaxKeepaliveCount = 40
 )
--- a/internal/deepseek/protocol/constants_shared.json
+++ b/internal/deepseek/protocol/constants_shared.json
@@ -2,7 +2,7 @@
  "client": {
    "name": "DeepSeek",
    "platform": "android",
-    "version": "2.0.1",
+    "version": "2.0.4",
    "android_api_level": "35",
    "locale": "zh_CN"
  },
@@ -24,4 +24,4 @@
  "skip_exact_paths": [
    "response/search_status"
  ]
-}
+}
--- a/internal/deepseek/protocol/sse.go
+++ b/internal/deepseek/protocol/sse.go
@@ -2,20 +2,24 @@ package protocol
 import (
 	"bufio"
 	"io"
 	"net/http"
 )
 func ScanSSELines(resp *http.Response, onLine func([]byte) bool) error {
-	scanner := bufio.NewScanner(resp.Body)
+	reader := bufio.NewReaderSize(resp.Body, 64*1024)
-	buf := make([]byte, 0, 64*1024)
+	for {
-	scanner.Buffer(buf, 2*1024*1024)
+		line, err := reader.ReadBytes('\n')
-	for scanner.Scan() {
+		if len(line) > 0 {
-		if !onLine(scanner.Bytes()) {
+			if !onLine(line) {
-			break
+				return nil
 			}
 		}
 		if err != nil {
 			if err == io.EOF {
 				return nil
 			}
 			return err
 		}
 	}
 	if err := scanner.Err(); err != nil {
 		return err
 	}
 	return nil
 }
--- a/internal/deepseek/protocol/sse_test.go
+++ b/internal/deepseek/protocol/sse_test.go
@@ -0,0 +1,26 @@
 package protocol
 import (
 	"io"
 	"net/http"
 	"strings"
 	"testing"
 )
 func TestScanSSELinesHandlesLongSingleLine(t *testing.T) {
 	payload := strings.Repeat("x", 2*1024*1024+4096)
 	body := "data: {\"p\":\"response/content\",\"v\":\"" + payload + "\"}\n"
 	resp := &http.Response{Body: io.NopCloser(strings.NewReader(body))}
 	var got string
 	err := ScanSSELines(resp, func(line []byte) bool {
 		got = string(line)
 		return true
 	})
 	if err != nil {
 		t.Fatalf("ScanSSELines returned error: %v", err)
 	}
 	if !strings.Contains(got, payload) {
 		t.Fatalf("long SSE line was not preserved: got len=%d want payload len=%d", len(got), len(payload))
 	}
 }
--- a/internal/devcapture/store.go
+++ b/internal/devcapture/store.go
@@ -10,6 +10,8 @@ import (
 	"sync"
 	"time"
 	"ds2api/internal/util"
 	"github.com/google/uuid"
 )
@@ -194,7 +196,8 @@ func (c *captureBody) append(chunk string) {
 	}
 	remain := maxLen - current
 	if len(chunk) > remain {
-		c.buf.WriteString(chunk[:remain])
+		truncated, _ := util.TruncateUTF8Bytes(chunk, remain)
 		c.buf.WriteString(truncated)
 		c.truncated = true
 		return
 	}
--- a/internal/devcapture/store_test.go
+++ b/internal/devcapture/store_test.go
@@ -4,6 +4,7 @@ import (
 	"io"
 	"strings"
 	"testing"
 	"unicode/utf8"
 )
 func TestNewFromEnvDefaults(t *testing.T) {
@@ -82,3 +83,28 @@ func TestWrapBodyTruncatesByLimit(t *testing.T) {
 		t.Fatalf("expected account id, got %q", items[0].AccountID)
 	}
 }
 func TestWrapBodyTruncatesUTF8WithoutBreakingRune(t *testing.T) {
 	s := &Store{enabled: true, limit: 5, maxBodyBytes: 5}
 	session := s.Start("test", "http://x", "acc1", map[string]any{"x": 1})
 	if session == nil {
 		t.Fatal("expected session")
 	}
 	rc := session.WrapBody(io.NopCloser(strings.NewReader("😀xy")), 200)
 	_, _ = io.ReadAll(rc)
 	_ = rc.Close()
 	items := s.Snapshot()
 	if len(items) != 1 {
 		t.Fatalf("expected 1 item, got %d", len(items))
 	}
 	if !utf8.ValidString(items[0].ResponseBody) {
 		t.Fatalf("expected valid utf-8 response body, got %q", items[0].ResponseBody)
 	}
 	if items[0].ResponseBody != "😀x" {
 		t.Fatalf("expected rune-safe truncation, got %q", items[0].ResponseBody)
 	}
 	if !items[0].ResponseTruncated {
 		t.Fatal("expected truncated flag true")
 	}
 }
--- a/internal/format/claude/render.go
+++ b/internal/format/claude/render.go
@@ -1,13 +1,56 @@
 package claude
 import (
 	"ds2api/internal/assistantturn"
 	"ds2api/internal/toolcall"
 	"fmt"
 	"time"
 	"ds2api/internal/prompt"
 	"ds2api/internal/util"
 )
 func BuildMessageResponseFromTurn(messageID, model string, turn assistantturn.Turn, exposeThinking bool) map[string]any {
 	content := make([]map[string]any, 0, 4)
 	if exposeThinking && turn.Thinking != "" {
 		content = append(content, map[string]any{"type": "thinking", "thinking": turn.Thinking})
 	}
 	stopReason := "end_turn"
 	if len(turn.ToolCalls) > 0 {
 		stopReason = "tool_use"
 		for i, tc := range turn.ToolCalls {
 			content = append(content, map[string]any{
 				"type":  "tool_use",
 				"id":    fmt.Sprintf("toolu_%d_%d", time.Now().Unix(), i),
 				"name":  tc.Name,
 				"input": tc.Input,
 			})
 		}
 	} else {
 		text := turn.Text
 		if text == "" && exposeThinking {
 			text = turn.Thinking
 		}
 		if text == "" {
 			text = "抱歉，没有生成有效的响应内容。"
 		}
 		content = append(content, map[string]any{"type": "text", "text": text})
 	}
 	return map[string]any{
 		"id":            messageID,
 		"type":          "message",
 		"role":          "assistant",
 		"model":         model,
 		"content":       content,
 		"stop_reason":   stopReason,
 		"stop_sequence": nil,
 		"usage": map[string]any{
 			"input_tokens":  turn.Usage.InputTokens,
 			"output_tokens": turn.Usage.OutputTokens,
 		},
 	}
 }
 func BuildMessageResponse(messageID, model string, normalizedMessages []any, finalThinking, finalText string, toolNames []string) map[string]any {
 	detected := toolcall.ParseToolCalls(finalText, toolNames)
 	if len(detected) == 0 && finalText == "" && finalThinking != "" {
@@ -43,8 +86,23 @@ func BuildMessageResponse(messageID, model string, normalizedMessages []any, fin
 		"stop_reason":   stopReason,
 		"stop_sequence": nil,
 		"usage": map[string]any{
-			"input_tokens":  util.EstimateTokens(fmt.Sprintf("%v", normalizedMessages)),
+			"input_tokens":  util.CountPromptTokens(prompt.MessagesPrepareWithThinking(claudeMessageMaps(normalizedMessages), false), model),
-			"output_tokens": util.EstimateTokens(finalThinking) + util.EstimateTokens(finalText),
+			"output_tokens": util.CountOutputTokens(finalThinking, model) + util.CountOutputTokens(finalText, model),
 		},
 	}
 }
 func claudeMessageMaps(messages []any) []map[string]any {
 	if len(messages) == 0 {
 		return nil
 	}
 	out := make([]map[string]any, 0, len(messages))
 	for _, item := range messages {
 		msg, ok := item.(map[string]any)
 		if !ok {
 			continue
 		}
 		out = append(out, msg)
 	}
 	return out
 }
--- a/internal/format/openai/render_chat.go
+++ b/internal/format/openai/render_chat.go
@@ -29,7 +29,7 @@ func BuildChatCompletionWithToolCalls(completionID, model, finalPrompt, finalThi
 		"created": time.Now().Unix(),
 		"model":   model,
 		"choices": []map[string]any{{"index": 0, "message": messageObj, "finish_reason": finishReason}},
-		"usage":   BuildChatUsage(finalPrompt, finalThinking, finalText),
+		"usage":   BuildChatUsageForModel(model, finalPrompt, finalThinking, finalText, 0),
 	}
 }
--- a/internal/format/openai/render_responses.go
+++ b/internal/format/openai/render_responses.go
@@ -70,7 +70,7 @@ func BuildResponseObjectFromItems(responseID, model, finalPrompt, finalThinking,
 		"model":       model,
 		"output":      output,
 		"output_text": outputText,
-		"usage":       BuildResponsesUsage(finalPrompt, finalThinking, finalText),
+		"usage":       BuildResponsesUsageForModel(model, finalPrompt, finalThinking, finalText, 0),
 	}
 }
--- a/internal/format/openai/render_test.go
+++ b/internal/format/openai/render_test.go
@@ -6,6 +6,7 @@ import (
 	"testing"
 	"ds2api/internal/toolcall"
 	"ds2api/internal/util"
 )
 func TestBuildResponseObjectKeepsFencedToolPayloadAsText(t *testing.T) {
@@ -177,3 +178,17 @@ func TestBuildResponseObjectWithToolCallsCoercesSchemaDeclaredStringArguments(t
 		t.Fatalf("expected response content stringified by schema, got %#v", args["content"])
 	}
 }
 func TestBuildChatUsageForModelUsesConservativePromptCount(t *testing.T) {
 	prompt := strings.Repeat("上下文token ", 40)
 	usage := BuildChatUsageForModel("deepseek-v4-flash", prompt, "", "ok", 0)
 	promptTokens, _ := usage["prompt_tokens"].(int)
 	if promptTokens <= util.EstimateTokens(prompt) {
 		t.Fatalf("expected conservative prompt token count > rough estimate, got=%d estimate=%d", promptTokens, util.EstimateTokens(prompt))
 	}
 	totalTokens, _ := usage["total_tokens"].(int)
 	completionTokens, _ := usage["completion_tokens"].(int)
 	if totalTokens != promptTokens+completionTokens {
 		t.Fatalf("expected total tokens to add up, got usage=%#v", usage)
 	}
 }
--- a/internal/format/openai/render_usage.go
+++ b/internal/format/openai/render_usage.go
@@ -2,10 +2,10 @@ package openai
 import "ds2api/internal/util"
-func BuildChatUsage(finalPrompt, finalThinking, finalText string) map[string]any {
+func BuildChatUsageForModel(model, finalPrompt, finalThinking, finalText string, refFileTokens int) map[string]any {
-	promptTokens := util.EstimateTokens(finalPrompt)
+	promptTokens := util.CountPromptTokens(finalPrompt, model) + refFileTokens
-	reasoningTokens := util.EstimateTokens(finalThinking)
+	reasoningTokens := util.CountOutputTokens(finalThinking, model)
-	completionTokens := util.EstimateTokens(finalText)
+	completionTokens := util.CountOutputTokens(finalText, model)
 	return map[string]any{
 		"prompt_tokens":     promptTokens,
 		"completion_tokens": reasoningTokens + completionTokens,
@@ -16,13 +16,21 @@ func BuildChatUsage(finalPrompt, finalThinking, finalText string) map[string]any
 	}
 }
-func BuildResponsesUsage(finalPrompt, finalThinking, finalText string) map[string]any {
+func BuildChatUsage(finalPrompt, finalThinking, finalText string) map[string]any {
-	promptTokens := util.EstimateTokens(finalPrompt)
+	return BuildChatUsageForModel("", finalPrompt, finalThinking, finalText, 0)
-	reasoningTokens := util.EstimateTokens(finalThinking)
+}
-	completionTokens := util.EstimateTokens(finalText)
+
 func BuildResponsesUsageForModel(model, finalPrompt, finalThinking, finalText string, refFileTokens int) map[string]any {
 	promptTokens := util.CountPromptTokens(finalPrompt, model) + refFileTokens
 	reasoningTokens := util.CountOutputTokens(finalThinking, model)
 	completionTokens := util.CountOutputTokens(finalText, model)
 	return map[string]any{
 		"input_tokens":  promptTokens,
 		"output_tokens": reasoningTokens + completionTokens,
 		"total_tokens":  promptTokens + reasoningTokens + completionTokens,
 	}
 }
 func BuildResponsesUsage(finalPrompt, finalThinking, finalText string) map[string]any {
 	return BuildResponsesUsageForModel("", finalPrompt, finalThinking, finalText, 0)
 }
--- a/internal/httpapi/admin/accounts/handler_accounts_testing.go
+++ b/internal/httpapi/admin/accounts/handler_accounts_testing.go
@@ -107,6 +107,7 @@ func (h *Handler) testAccount(ctx context.Context, acc config.Account, model, me
 		"model":           model,
 		"session_count":   0,
 		"config_writable": !h.Store.IsEnvBacked(),
 		"config_warning":  "",
 	}
 	defer func() {
 		status := "failed"
@@ -121,8 +122,7 @@ func (h *Handler) testAccount(ctx context.Context, acc config.Account, model, me
 		return result
 	}
 	if err := h.Store.UpdateAccountToken(acc.Identifier(), token); err != nil {
-		result["message"] = "登录成功但写入运行时 token 失败: " + err.Error()
+		result["config_warning"] = "登录成功，但 token 持久化失败（仅保存在内存，重启后会丢失）: " + err.Error()
 		return result
 	}
 	authCtx := &authn.RequestAuth{UseConfigToken: false, DeepSeekToken: token, AccountID: identifier, Account: acc}
 	proxyCtx := authn.WithAuth(ctx, authCtx)
@@ -136,8 +136,7 @@ func (h *Handler) testAccount(ctx context.Context, acc config.Account, model, me
 		token = newToken
 		authCtx.DeepSeekToken = token
 		if err := h.Store.UpdateAccountToken(acc.Identifier(), token); err != nil {
-			result["message"] = "刷新 token 成功但写入运行时 token 失败: " + err.Error()
+			result["config_warning"] = "刷新 token 成功，但 token 持久化失败（仅保存在内存，重启后会丢失）: " + err.Error()
 			return result
 		}
 		sessionID, err = h.DS.CreateSession(proxyCtx, authCtx, 1)
 		if err != nil {
@@ -155,6 +154,9 @@ func (h *Handler) testAccount(ctx context.Context, acc config.Account, model, me
 	if strings.TrimSpace(message) == "" {
 		result["success"] = true
 		result["message"] = "Token 刷新成功（登录与会话创建成功）"
 		if warning, _ := result["config_warning"].(string); strings.TrimSpace(warning) != "" {
 			result["message"] = result["message"].(string) + "；" + warning
 		}
 		result["response_time"] = int(time.Since(start).Milliseconds())
 		return result
 	}
--- a/internal/httpapi/admin/handler_settings_test.go
+++ b/internal/httpapi/admin/handler_settings_test.go
@@ -208,9 +208,6 @@ func TestUpdateSettingsCurrentInputFile(t *testing.T) {
 	if !h.Store.CurrentInputFileEnabled() {
 		t.Fatal("expected current input file accessor to stay enabled")
 	}
 	if h.Store.HistorySplitEnabled() {
 		t.Fatal("expected history split accessor to stay disabled")
 	}
 }
 func TestUpdateSettingsCurrentInputFilePartialUpdatePreservesEnabled(t *testing.T) {
--- a/internal/httpapi/admin/rawsamples/handler_raw_samples.go
+++ b/internal/httpapi/admin/rawsamples/handler_raw_samples.go
@@ -15,6 +15,7 @@ import (
 	"ds2api/internal/devcapture"
 	adminshared "ds2api/internal/httpapi/admin/shared"
 	"ds2api/internal/rawsample"
 	"ds2api/internal/util"
 )
 type captureChain struct {
@@ -479,10 +480,13 @@ func previewCaptureChainResponse(chain captureChain) string {
 func previewText(text string, limit int) string {
 	text = strings.TrimSpace(text)
-	if limit <= 0 || len(text) <= limit {
+	if limit <= 0 {
 		return text
 	}
-	return text[:limit] + "..."
+	if truncated, ok := util.TruncateRunes(text, limit); ok {
 		return truncated + "..."
 	}
 	return text
 }
 func captureChainHasTruncatedResponse(chain captureChain) bool {
--- a/internal/httpapi/admin/rawsamples/handler_raw_samples_test.go
+++ b/internal/httpapi/admin/rawsamples/handler_raw_samples_test.go
@@ -10,6 +10,7 @@ import (
 	"path/filepath"
 	"strings"
 	"testing"
 	"unicode/utf8"
 	"ds2api/internal/devcapture"
 )
@@ -231,6 +232,16 @@ func TestCombineCaptureBodiesPreservesOrderAndSeparators(t *testing.T) {
 	}
 }
 func TestPreviewTextPreservesUTF8MB4Characters(t *testing.T) {
 	preview := previewText(strings.Repeat("😀", 281), 280)
 	if !utf8.ValidString(preview) {
 		t.Fatalf("expected valid utf-8 preview, got %q", preview)
 	}
 	if preview != strings.Repeat("😀", 280)+"..." {
 		t.Fatalf("unexpected preview: %q", preview)
 	}
 }
 func TestQueryRawSampleCapturesGroupsBySessionAndMatchesQuestion(t *testing.T) {
 	devcapture.Global().Clear()
 	defer devcapture.Global().Clear()
--- a/internal/httpapi/admin/settings/handler_settings_parse.go
+++ b/internal/httpapi/admin/settings/handler_settings_parse.go
@@ -21,11 +21,10 @@ func boolFrom(v any) bool {
 	}
 }
-func parseSettingsUpdateRequest(req map[string]any) (*config.AdminConfig, *config.RuntimeConfig, *config.CompatConfig, *config.ResponsesConfig, *config.EmbeddingsConfig, *config.AutoDeleteConfig, *config.CurrentInputFileConfig, *config.ThinkingInjectionConfig, map[string]string, error) {
+func parseSettingsUpdateRequest(req map[string]any) (*config.AdminConfig, *config.RuntimeConfig, *config.ResponsesConfig, *config.EmbeddingsConfig, *config.AutoDeleteConfig, *config.CurrentInputFileConfig, *config.ThinkingInjectionConfig, map[string]string, error) {
 	var (
 		adminCfg        *config.AdminConfig
 		runtimeCfg      *config.RuntimeConfig
 		compatCfg       *config.CompatConfig
 		respCfg         *config.ResponsesConfig
 		embCfg          *config.EmbeddingsConfig
 		autoDeleteCfg   *config.AutoDeleteConfig
@@ -39,7 +38,7 @@ func parseSettingsUpdateRequest(req map[string]any) (*config.AdminConfig, *confi
 		if v, exists := raw["jwt_expire_hours"]; exists {
 			n := intFrom(v)
 			if err := config.ValidateIntRange("admin.jwt_expire_hours", n, 1, 720, true); err != nil {
-				return nil, nil, nil, nil, nil, nil, nil, nil, nil, err
+				return nil, nil, nil, nil, nil, nil, nil, nil, err
 			}
 			cfg.JWTExpireHours = n
 		}
@@ -51,56 +50,43 @@ func parseSettingsUpdateRequest(req map[string]any) (*config.AdminConfig, *confi
 		if v, exists := raw["account_max_inflight"]; exists {
 			n := intFrom(v)
 			if err := config.ValidateIntRange("runtime.account_max_inflight", n, 1, 256, true); err != nil {
-				return nil, nil, nil, nil, nil, nil, nil, nil, nil, err
+				return nil, nil, nil, nil, nil, nil, nil, nil, err
 			}
 			cfg.AccountMaxInflight = n
 		}
 		if v, exists := raw["account_max_queue"]; exists {
 			n := intFrom(v)
 			if err := config.ValidateIntRange("runtime.account_max_queue", n, 1, 200000, true); err != nil {
-				return nil, nil, nil, nil, nil, nil, nil, nil, nil, err
+				return nil, nil, nil, nil, nil, nil, nil, nil, err
 			}
 			cfg.AccountMaxQueue = n
 		}
 		if v, exists := raw["global_max_inflight"]; exists {
 			n := intFrom(v)
 			if err := config.ValidateIntRange("runtime.global_max_inflight", n, 1, 200000, true); err != nil {
-				return nil, nil, nil, nil, nil, nil, nil, nil, nil, err
+				return nil, nil, nil, nil, nil, nil, nil, nil, err
 			}
 			cfg.GlobalMaxInflight = n
 		}
 		if v, exists := raw["token_refresh_interval_hours"]; exists {
 			n := intFrom(v)
 			if err := config.ValidateIntRange("runtime.token_refresh_interval_hours", n, 1, 720, true); err != nil {
-				return nil, nil, nil, nil, nil, nil, nil, nil, nil, err
+				return nil, nil, nil, nil, nil, nil, nil, nil, err
 			}
 			cfg.TokenRefreshIntervalHours = n
 		}
 		if cfg.AccountMaxInflight > 0 && cfg.GlobalMaxInflight > 0 && cfg.GlobalMaxInflight < cfg.AccountMaxInflight {
-			return nil, nil, nil, nil, nil, nil, nil, nil, nil, fmt.Errorf("runtime.global_max_inflight must be >= runtime.account_max_inflight")
+			return nil, nil, nil, nil, nil, nil, nil, nil, fmt.Errorf("runtime.global_max_inflight must be >= runtime.account_max_inflight")
 		}
 		runtimeCfg = cfg
 	}
 	if raw, ok := req["compat"].(map[string]any); ok {
 		cfg := &config.CompatConfig{}
 		if v, exists := raw["wide_input_strict_output"]; exists {
 			b := boolFrom(v)
 			cfg.WideInputStrictOutput = &b
 		}
 		if v, exists := raw["strip_reference_markers"]; exists {
 			b := boolFrom(v)
 			cfg.StripReferenceMarkers = &b
 		}
 		compatCfg = cfg
 	}
 	if raw, ok := req["responses"].(map[string]any); ok {
 		cfg := &config.ResponsesConfig{}
 		if v, exists := raw["store_ttl_seconds"]; exists {
 			n := intFrom(v)
 			if err := config.ValidateIntRange("responses.store_ttl_seconds", n, 30, 86400, true); err != nil {
-				return nil, nil, nil, nil, nil, nil, nil, nil, nil, err
+				return nil, nil, nil, nil, nil, nil, nil, nil, err
 			}
 			cfg.StoreTTLSeconds = n
 		}
@@ -112,7 +98,7 @@ func parseSettingsUpdateRequest(req map[string]any) (*config.AdminConfig, *confi
 		if v, exists := raw["provider"]; exists {
 			p := strings.TrimSpace(fmt.Sprintf("%v", v))
 			if err := config.ValidateTrimmedString("embeddings.provider", p, false); err != nil {
-				return nil, nil, nil, nil, nil, nil, nil, nil, nil, err
+				return nil, nil, nil, nil, nil, nil, nil, nil, err
 			}
 			cfg.Provider = p
 		}
@@ -138,7 +124,7 @@ func parseSettingsUpdateRequest(req map[string]any) (*config.AdminConfig, *confi
 		if v, exists := raw["mode"]; exists {
 			mode := strings.ToLower(strings.TrimSpace(fmt.Sprintf("%v", v)))
 			if err := config.ValidateAutoDeleteMode(mode); err != nil {
-				return nil, nil, nil, nil, nil, nil, nil, nil, nil, err
+				return nil, nil, nil, nil, nil, nil, nil, nil, err
 			}
 			if mode == "" {
 				mode = "none"
@@ -160,12 +146,12 @@ func parseSettingsUpdateRequest(req map[string]any) (*config.AdminConfig, *confi
 		if v, exists := raw["min_chars"]; exists {
 			n := intFrom(v)
 			if err := config.ValidateIntRange("current_input_file.min_chars", n, 0, 100000000, true); err != nil {
-				return nil, nil, nil, nil, nil, nil, nil, nil, nil, err
+				return nil, nil, nil, nil, nil, nil, nil, nil, err
 			}
 			cfg.MinChars = n
 		}
 		if err := config.ValidateCurrentInputFileConfig(*cfg); err != nil {
-			return nil, nil, nil, nil, nil, nil, nil, nil, nil, err
+			return nil, nil, nil, nil, nil, nil, nil, nil, err
 		}
 		currentInputCfg = cfg
 	}
@@ -182,5 +168,5 @@ func parseSettingsUpdateRequest(req map[string]any) (*config.AdminConfig, *confi
 		thinkingInjCfg = cfg
 	}
-	return adminCfg, runtimeCfg, compatCfg, respCfg, embCfg, autoDeleteCfg, currentInputCfg, thinkingInjCfg, aliasMap, nil
+	return adminCfg, runtimeCfg, respCfg, embCfg, autoDeleteCfg, currentInputCfg, thinkingInjCfg, aliasMap, nil
 }
--- a/internal/httpapi/admin/settings/handler_settings_read.go
+++ b/internal/httpapi/admin/settings/handler_settings_read.go
@@ -27,7 +27,6 @@ func (h *Handler) getSettings(w http.ResponseWriter, _ *http.Request) {
 			"global_max_inflight":          h.Store.RuntimeGlobalMaxInflight(recommended),
 			"token_refresh_interval_hours": h.Store.RuntimeTokenRefreshIntervalHours(),
 		},
 		"compat":      snap.Compat,
 		"responses":   snap.Responses,
 		"embeddings":  snap.Embeddings,
 		"auto_delete": snap.AutoDelete,
--- a/internal/httpapi/admin/settings/handler_settings_write.go
+++ b/internal/httpapi/admin/settings/handler_settings_write.go
@@ -17,7 +17,7 @@ func (h *Handler) updateSettings(w http.ResponseWriter, r *http.Request) {
 		return
 	}
-	adminCfg, runtimeCfg, compatCfg, responsesCfg, embeddingsCfg, autoDeleteCfg, currentInputCfg, thinkingInjCfg, aliasMap, err := parseSettingsUpdateRequest(req)
+	adminCfg, runtimeCfg, responsesCfg, embeddingsCfg, autoDeleteCfg, currentInputCfg, thinkingInjCfg, aliasMap, err := parseSettingsUpdateRequest(req)
 	if err != nil {
 		writeJSON(w, http.StatusBadRequest, map[string]any{"detail": err.Error()})
 		return
@@ -53,14 +53,6 @@ func (h *Handler) updateSettings(w http.ResponseWriter, r *http.Request) {
 				c.Runtime.TokenRefreshIntervalHours = runtimeCfg.TokenRefreshIntervalHours
 			}
 		}
 		if compatCfg != nil {
 			if compatCfg.WideInputStrictOutput != nil {
 				c.Compat.WideInputStrictOutput = compatCfg.WideInputStrictOutput
 			}
 			if compatCfg.StripReferenceMarkers != nil {
 				c.Compat.StripReferenceMarkers = compatCfg.StripReferenceMarkers
 			}
 		}
 		if responsesCfg != nil && responsesCfg.StoreTTLSeconds > 0 {
 			c.Responses.StoreTTLSeconds = responsesCfg.StoreTTLSeconds
 		}
--- a/internal/httpapi/admin/shared/deps.go
+++ b/internal/httpapi/admin/shared/deps.go
@@ -33,13 +33,10 @@ type ConfigStore interface {
 	RuntimeGlobalMaxInflight(defaultSize int) int
 	RuntimeTokenRefreshIntervalHours() int
 	AutoDeleteMode() string
 	HistorySplitEnabled() bool
 	HistorySplitTriggerAfterTurns() int
 	CurrentInputFileEnabled() bool
 	CurrentInputFileMinChars() int
 	ThinkingInjectionEnabled() bool
 	ThinkingInjectionPrompt() string
 	CompatStripReferenceMarkers() bool
 	AutoDeleteSessions() bool
 }
--- a/internal/httpapi/claude/current_input_file_test.go
+++ b/internal/httpapi/claude/current_input_file_test.go
@@ -0,0 +1,158 @@
 package claude
 import (
 	"context"
 	"io"
 	"net/http"
 	"net/http/httptest"
 	"path/filepath"
 	"strings"
 	"testing"
 	"ds2api/internal/auth"
 	"ds2api/internal/chathistory"
 	dsclient "ds2api/internal/deepseek/client"
 )
 type claudeCurrentInputAuth struct{}
 type claudeHistoryConfig struct {
 	aliases map[string]string
 }
 func (m claudeHistoryConfig) ModelAliases() map[string]string { return m.aliases }
 func (claudeHistoryConfig) CurrentInputFileEnabled() bool     { return false }
 func (claudeHistoryConfig) CurrentInputFileMinChars() int     { return 0 }
 func (claudeCurrentInputAuth) Determine(*http.Request) (*auth.RequestAuth, error) {
 	return &auth.RequestAuth{
 		DeepSeekToken: "direct-token",
 		CallerID:      "caller:test",
 		TriedAccounts: map[string]bool{},
 	}, nil
 }
 func TestClaudeDirectRecordsResponseHistory(t *testing.T) {
 	ds := &claudeCurrentInputDS{}
 	historyStore := chathistory.New(filepath.Join(t.TempDir(), "history.json"))
 	h := &Handler{
 		Store:       claudeHistoryConfig{aliases: map[string]string{"claude-sonnet-4-6": "deepseek-v4-flash"}},
 		Auth:        claudeCurrentInputAuth{},
 		DS:          ds,
 		ChatHistory: historyStore,
 	}
 	reqBody := `{"model":"claude-sonnet-4-6","messages":[{"role":"user","content":"hello from claude"}],"max_tokens":1024}`
 	req := httptest.NewRequest(http.MethodPost, "/v1/messages", strings.NewReader(reqBody))
 	req.Header.Set("Content-Type", "application/json")
 	rec := httptest.NewRecorder()
 	h.Messages(rec, req)
 	if rec.Code != http.StatusOK {
 		t.Fatalf("expected 200, got %d body=%s", rec.Code, rec.Body.String())
 	}
 	snapshot, err := historyStore.Snapshot()
 	if err != nil {
 		t.Fatalf("snapshot history: %v", err)
 	}
 	if len(snapshot.Items) != 1 {
 		t.Fatalf("expected one history item, got %d", len(snapshot.Items))
 	}
 	item, err := historyStore.Get(snapshot.Items[0].ID)
 	if err != nil {
 		t.Fatalf("get history item: %v", err)
 	}
 	if item.Surface != "claude.messages" {
 		t.Fatalf("unexpected surface: %q", item.Surface)
 	}
 	if item.Model != "claude-sonnet-4-6" {
 		t.Fatalf("unexpected model: %q", item.Model)
 	}
 	if item.UserInput != "hello from claude" {
 		t.Fatalf("unexpected user input: %q", item.UserInput)
 	}
 	if item.Content != "ok" {
 		t.Fatalf("expected raw upstream content, got %q", item.Content)
 	}
 }
 func (claudeCurrentInputAuth) Release(*auth.RequestAuth) {}
 type claudeCurrentInputDS struct {
 	uploads []dsclient.UploadFileRequest
 	payload map[string]any
 }
 func (d *claudeCurrentInputDS) CreateSession(context.Context, *auth.RequestAuth, int) (string, error) {
 	return "session-id", nil
 }
 func (d *claudeCurrentInputDS) GetPow(context.Context, *auth.RequestAuth, int) (string, error) {
 	return "pow", nil
 }
 func (d *claudeCurrentInputDS) UploadFile(_ context.Context, _ *auth.RequestAuth, req dsclient.UploadFileRequest, _ int) (*dsclient.UploadFileResult, error) {
 	d.uploads = append(d.uploads, req)
 	return &dsclient.UploadFileResult{ID: "file-claude-history"}, nil
 }
 func (d *claudeCurrentInputDS) CallCompletion(_ context.Context, _ *auth.RequestAuth, payload map[string]any, _ string, _ int) (*http.Response, error) {
 	d.payload = payload
 	return &http.Response{
 		StatusCode: http.StatusOK,
 		Header:     make(http.Header),
 		Body:       io.NopCloser(strings.NewReader("data: {\"p\":\"response/content\",\"v\":\"ok\"}\n")),
 	}, nil
 }
 func TestClaudeDirectAppliesCurrentInputFile(t *testing.T) {
 	ds := &claudeCurrentInputDS{}
 	historyStore := chathistory.New(filepath.Join(t.TempDir(), "history.json"))
 	h := &Handler{
 		Store:       mockClaudeConfig{aliases: map[string]string{"claude-sonnet-4-6": "deepseek-v4-flash"}},
 		Auth:        claudeCurrentInputAuth{},
 		DS:          ds,
 		ChatHistory: historyStore,
 	}
 	reqBody := `{"model":"claude-sonnet-4-6","messages":[{"role":"user","content":"hello from claude"}],"max_tokens":1024}`
 	req := httptest.NewRequest(http.MethodPost, "/v1/messages", strings.NewReader(reqBody))
 	req.Header.Set("Content-Type", "application/json")
 	rec := httptest.NewRecorder()
 	h.Messages(rec, req)
 	if rec.Code != http.StatusOK {
 		t.Fatalf("expected 200, got %d body=%s", rec.Code, rec.Body.String())
 	}
 	if len(ds.uploads) != 1 {
 		t.Fatalf("expected one current input upload, got %d", len(ds.uploads))
 	}
 	if ds.uploads[0].Filename != "DS2API_HISTORY.txt" {
 		t.Fatalf("unexpected upload filename: %q", ds.uploads[0].Filename)
 	}
 	refIDs, _ := ds.payload["ref_file_ids"].([]any)
 	if len(refIDs) != 1 || refIDs[0] != "file-claude-history" {
 		t.Fatalf("expected uploaded history ref id, got %#v", ds.payload["ref_file_ids"])
 	}
 	prompt, _ := ds.payload["prompt"].(string)
 	if !strings.Contains(prompt, "Continue from the latest state in the attached DS2API_HISTORY.txt context.") {
 		t.Fatalf("expected continuation prompt, got %q", prompt)
 	}
 	snapshot, err := historyStore.Snapshot()
 	if err != nil {
 		t.Fatalf("snapshot history: %v", err)
 	}
 	if len(snapshot.Items) != 1 {
 		t.Fatalf("expected one history item, got %d", len(snapshot.Items))
 	}
 	full, err := historyStore.Get(snapshot.Items[0].ID)
 	if err != nil {
 		t.Fatalf("get history item: %v", err)
 	}
 	if full.HistoryText != string(ds.uploads[0].Data) {
 		t.Fatalf("expected uploaded current input file to be persisted in history text")
 	}
 	if len(full.Messages) != 1 || !strings.Contains(full.Messages[0].Content, "Continue from the latest state in the attached DS2API_HISTORY.txt context.") {
 		t.Fatalf("expected persisted message to match upstream continuation prompt, got %#v", full.Messages)
 	}
 }
--- a/internal/httpapi/claude/deps.go
+++ b/internal/httpapi/claude/deps.go
@@ -17,12 +17,14 @@ type AuthResolver interface {
 type DeepSeekCaller interface {
 	CreateSession(ctx context.Context, a *auth.RequestAuth, maxAttempts int) (string, error)
 	GetPow(ctx context.Context, a *auth.RequestAuth, maxAttempts int) (string, error)
 	UploadFile(ctx context.Context, a *auth.RequestAuth, req dsclient.UploadFileRequest, maxAttempts int) (*dsclient.UploadFileResult, error)
 	CallCompletion(ctx context.Context, a *auth.RequestAuth, payload map[string]any, powResp string, maxAttempts int) (*http.Response, error)
 }
 type ConfigReader interface {
 	ModelAliases() map[string]string
-	CompatStripReferenceMarkers() bool
+	CurrentInputFileEnabled() bool
 	CurrentInputFileMinChars() int
 }
 type OpenAIChatRunner interface {
--- a/internal/httpapi/claude/deps_injection_test.go
+++ b/internal/httpapi/claude/deps_injection_test.go
@@ -7,7 +7,8 @@ type mockClaudeConfig struct {
 }
 func (m mockClaudeConfig) ModelAliases() map[string]string { return m.aliases }
-func (mockClaudeConfig) CompatStripReferenceMarkers() bool { return true }
+func (mockClaudeConfig) CurrentInputFileEnabled() bool     { return true }
 func (mockClaudeConfig) CurrentInputFileMinChars() int     { return 0 }
 func TestNormalizeClaudeRequestUsesGlobalAliasMapping(t *testing.T) {
 	req := map[string]any{
@@ -27,11 +28,32 @@ func TestNormalizeClaudeRequestUsesGlobalAliasMapping(t *testing.T) {
 	if out.Standard.ResolvedModel != "deepseek-v4-pro-search" {
 		t.Fatalf("resolved model mismatch: got=%q", out.Standard.ResolvedModel)
 	}
-	if out.Standard.Thinking || !out.Standard.Search {
+	if !out.Standard.Thinking || !out.Standard.Search {
 		t.Fatalf("unexpected flags: thinking=%v search=%v", out.Standard.Thinking, out.Standard.Search)
 	}
 }
 func TestNormalizeClaudeRequestDisablesThinkingWhenRequested(t *testing.T) {
 	req := map[string]any{
 		"model": "claude-opus-4-6",
 		"messages": []any{
 			map[string]any{"role": "user", "content": "hello"},
 		},
 		"thinking": map[string]any{"type": "disabled"},
 	}
 	out, err := normalizeClaudeRequest(mockClaudeConfig{
 		aliases: map[string]string{
 			"claude-opus-4-6": "deepseek-v4-pro",
 		},
 	}, req)
 	if err != nil {
 		t.Fatalf("normalizeClaudeRequest error: %v", err)
 	}
 	if out.Standard.Thinking {
 		t.Fatalf("expected explicit Claude thinking disable to win")
 	}
 }
 func TestNormalizeClaudeRequestEnablesThinkingWhenRequested(t *testing.T) {
 	req := map[string]any{
 		"model": "claude-opus-4-6",
--- a/internal/httpapi/claude/handler_messages.go
+++ b/internal/httpapi/claude/handler_messages.go
@@ -2,13 +2,24 @@ package claude
 import (
 	"bytes"
 	"context"
 	"encoding/json"
 	"errors"
 	"fmt"
 	"io"
 	"net/http"
 	"net/http/httptest"
 	"strings"
 	"time"
 	"ds2api/internal/auth"
 	"ds2api/internal/completionruntime"
 	"ds2api/internal/config"
 	claudefmt "ds2api/internal/format/claude"
 	"ds2api/internal/httpapi/openai/history"
 	"ds2api/internal/httpapi/requestbody"
 	"ds2api/internal/promptcompat"
 	"ds2api/internal/responsehistory"
 	streamengine "ds2api/internal/stream"
 	"ds2api/internal/translatorcliproxy"
 	"ds2api/internal/util"
@@ -20,20 +31,131 @@ func (h *Handler) Messages(w http.ResponseWriter, r *http.Request) {
 	if strings.TrimSpace(r.Header.Get("anthropic-version")) == "" {
 		r.Header.Set("anthropic-version", "2023-06-01")
 	}
-	if h.OpenAI == nil {
+	if isClaudeVercelProxyRequest(r) && h.proxyViaOpenAI(w, r, h.Store) {
 		writeClaudeError(w, http.StatusInternalServerError, "OpenAI proxy backend unavailable.")
 		return
 	}
-	if h.proxyViaOpenAI(w, r, h.Store) {
+	if h.Auth == nil || h.DS == nil {
 		if h.OpenAI != nil && h.proxyViaOpenAI(w, r, h.Store) {
 			return
 		}
 		writeClaudeError(w, http.StatusInternalServerError, "Claude runtime backend unavailable.")
 		return
 	}
-	writeClaudeError(w, http.StatusBadGateway, "Failed to proxy Claude request.")
+	if h.handleClaudeDirect(w, r) {
 		return
 	}
 	writeClaudeError(w, http.StatusBadGateway, "Failed to handle Claude request.")
 }
 func isClaudeVercelProxyRequest(r *http.Request) bool {
 	if r == nil || r.URL == nil {
 		return false
 	}
 	return strings.TrimSpace(r.URL.Query().Get("__stream_prepare")) == "1" ||
 		strings.TrimSpace(r.URL.Query().Get("__stream_release")) == "1"
 }
 func (h *Handler) handleClaudeDirect(w http.ResponseWriter, r *http.Request) bool {
 	raw, err := io.ReadAll(r.Body)
 	if err != nil {
 		if errors.Is(err, requestbody.ErrInvalidUTF8Body) {
 			writeClaudeError(w, http.StatusBadRequest, "invalid json")
 		} else {
 			writeClaudeError(w, http.StatusBadRequest, "invalid body")
 		}
 		return true
 	}
 	var req map[string]any
 	if err := json.Unmarshal(raw, &req); err != nil {
 		writeClaudeError(w, http.StatusBadRequest, "invalid json")
 		return true
 	}
 	norm, err := normalizeClaudeRequest(h.Store, req)
 	if err != nil {
 		writeClaudeError(w, http.StatusBadRequest, err.Error())
 		return true
 	}
 	exposeThinking := norm.Standard.Thinking
 	a, err := h.Auth.Determine(r)
 	if err != nil {
 		writeClaudeError(w, http.StatusUnauthorized, err.Error())
 		return true
 	}
 	defer h.Auth.Release(a)
 	stdReq, err := h.applyCurrentInputFile(r.Context(), a, norm.Standard)
 	if err != nil {
 		status, message := mapCurrentInputFileError(err)
 		writeClaudeError(w, status, message)
 		return true
 	}
 	historySession := responsehistory.Start(responsehistory.StartParams{
 		Store:    h.ChatHistory,
 		Request:  r,
 		Auth:     a,
 		Surface:  "claude.messages",
 		Standard: stdReq,
 	})
 	if stdReq.Stream {
 		h.handleClaudeDirectStream(w, r, a, stdReq, historySession)
 		return true
 	}
 	result, outErr := completionruntime.ExecuteNonStreamWithRetry(r.Context(), h.DS, a, stdReq, completionruntime.Options{
 		RetryEnabled:     true,
 		CurrentInputFile: h.Store,
 	})
 	if outErr != nil {
 		if historySession != nil {
 			historySession.ErrorTurn(outErr.Status, outErr.Message, outErr.Code, result.Turn)
 		}
 		writeClaudeError(w, outErr.Status, outErr.Message)
 		return true
 	}
 	if historySession != nil {
 		historySession.SuccessTurn(http.StatusOK, result.Turn, responsehistory.GenericUsage(result.Turn))
 	}
 	writeJSON(w, http.StatusOK, claudefmt.BuildMessageResponseFromTurn(
 		fmt.Sprintf("msg_%d", time.Now().UnixNano()),
 		stdReq.ResponseModel,
 		result.Turn,
 		exposeThinking,
 	))
 	return true
 }
 func (h *Handler) applyCurrentInputFile(ctx context.Context, a *auth.RequestAuth, stdReq promptcompat.StandardRequest) (promptcompat.StandardRequest, error) {
 	if h == nil {
 		return stdReq, nil
 	}
 	return (history.Service{Store: h.Store, DS: h.DS}).ApplyCurrentInputFile(ctx, a, stdReq)
 }
 func mapCurrentInputFileError(err error) (int, string) {
 	return history.MapError(err)
 }
 func (h *Handler) handleClaudeDirectStream(w http.ResponseWriter, r *http.Request, a *auth.RequestAuth, stdReq promptcompat.StandardRequest, historySession *responsehistory.Session) {
 	start, outErr := completionruntime.StartCompletion(r.Context(), h.DS, a, stdReq, completionruntime.Options{
 		CurrentInputFile: h.Store,
 	})
 	if outErr != nil {
 		if historySession != nil {
 			historySession.Error(outErr.Status, outErr.Message, outErr.Code, "", "")
 		}
 		writeClaudeError(w, outErr.Status, outErr.Message)
 		return
 	}
 	streamReq := start.Request
 	h.handleClaudeStreamRealtime(w, r, start.Response, streamReq.ResponseModel, streamReq.Messages, streamReq.Thinking, streamReq.Search, streamReq.ToolNames, streamReq.ToolsRaw, historySession)
 }
 func (h *Handler) proxyViaOpenAI(w http.ResponseWriter, r *http.Request, store ConfigReader) bool {
 	raw, err := io.ReadAll(r.Body)
 	if err != nil {
-		writeClaudeError(w, http.StatusBadRequest, "invalid body")
+		if errors.Is(err, requestbody.ErrInvalidUTF8Body) {
 			writeClaudeError(w, http.StatusBadRequest, "invalid json")
 		} else {
 			writeClaudeError(w, http.StatusBadRequest, "invalid body")
 		}
 		return true
 	}
 	var req map[string]any
@@ -52,7 +174,7 @@ func (h *Handler) proxyViaOpenAI(w http.ResponseWriter, r *http.Request, store C
 		}
 	}
 	translatedReq := translatorcliproxy.ToOpenAI(sdktranslator.FormatClaude, translateModel, raw, stream)
-	translatedReq, exposeThinking := applyClaudeThinkingPolicyToOpenAIRequest(translatedReq, req, stream)
+	translatedReq, exposeThinking := applyClaudeThinkingPolicyToOpenAIRequest(translatedReq, req)
 	isVercelPrepare := strings.TrimSpace(r.URL.Query().Get("__stream_prepare")) == "1"
 	isVercelRelease := strings.TrimSpace(r.URL.Query().Get("__stream_release")) == "1"
@@ -127,7 +249,7 @@ func (h *Handler) proxyViaOpenAI(w http.ResponseWriter, r *http.Request, store C
 	return true
 }
-func applyClaudeThinkingPolicyToOpenAIRequest(translated []byte, original map[string]any, stream bool) ([]byte, bool) {
+func applyClaudeThinkingPolicyToOpenAIRequest(translated []byte, original map[string]any) ([]byte, bool) {
 	req := map[string]any{}
 	if err := json.Unmarshal(translated, &req); err != nil {
 		return translated, false
@@ -137,7 +259,7 @@ func applyClaudeThinkingPolicyToOpenAIRequest(translated []byte, original map[st
 		if _, translatedHasOverride := util.ResolveThinkingOverride(req); translatedHasOverride {
 			return translated, false
 		}
-		enabled = !stream
+		enabled = true
 	}
 	typ := "disabled"
 	if enabled {
@@ -146,9 +268,9 @@ func applyClaudeThinkingPolicyToOpenAIRequest(translated []byte, original map[st
 	req["thinking"] = map[string]any{"type": typ}
 	out, err := json.Marshal(req)
 	if err != nil {
-		return translated, ok && enabled
+		return translated, enabled
 	}
-	return out, ok && enabled
+	return out, enabled
 }
 func stripClaudeThinkingBlocks(raw []byte) []byte {
@@ -177,10 +299,17 @@ func stripClaudeThinkingBlocks(raw []byte) []byte {
 	return out
 }
-func (h *Handler) handleClaudeStreamRealtime(w http.ResponseWriter, r *http.Request, resp *http.Response, model string, messages []any, thinkingEnabled, searchEnabled bool, toolNames []string, toolsRaw any) {
+func (h *Handler) handleClaudeStreamRealtime(w http.ResponseWriter, r *http.Request, resp *http.Response, model string, messages []any, thinkingEnabled, searchEnabled bool, toolNames []string, toolsRaw any, historySessions ...*responsehistory.Session) {
 	var historySession *responsehistory.Session
 	if len(historySessions) > 0 {
 		historySession = historySessions[0]
 	}
 	defer func() { _ = resp.Body.Close() }()
 	if resp.StatusCode != http.StatusOK {
 		body, _ := io.ReadAll(resp.Body)
 		if historySession != nil {
 			historySession.Error(resp.StatusCode, strings.TrimSpace(string(body)), "error", "", "")
 		}
 		writeClaudeError(w, http.StatusInternalServerError, string(body))
 		return
 	}
@@ -203,9 +332,11 @@ func (h *Handler) handleClaudeStreamRealtime(w http.ResponseWriter, r *http.Requ
 		messages,
 		thinkingEnabled,
 		searchEnabled,
-		h.compatStripReferenceMarkers(),
+		stripReferenceMarkersEnabled(),
 		toolNames,
 		toolsRaw,
 		buildClaudePromptTokenText(messages, thinkingEnabled),
 		historySession,
 	)
 	streamRuntime.sendMessageStart()
--- a/internal/httpapi/claude/handler_routes.go
+++ b/internal/httpapi/claude/handler_routes.go
@@ -6,8 +6,10 @@ import (
 	"github.com/go-chi/chi/v5"
 	"ds2api/internal/chathistory"
 	"ds2api/internal/config"
 	dsprotocol "ds2api/internal/deepseek/protocol"
 	"ds2api/internal/textclean"
 	"ds2api/internal/util"
 )
@@ -15,17 +17,15 @@ import (
 var writeJSON = util.WriteJSON
 type Handler struct {
-	Store  ConfigReader
+	Store       ConfigReader
-	Auth   AuthResolver
+	Auth        AuthResolver
-	DS     DeepSeekCaller
+	DS          DeepSeekCaller
-	OpenAI OpenAIChatRunner
+	OpenAI      OpenAIChatRunner
 	ChatHistory *chathistory.Store
 }
-func (h *Handler) compatStripReferenceMarkers() bool {
+func stripReferenceMarkersEnabled() bool {
-	if h == nil || h.Store == nil {
+	return textclean.StripReferenceMarkersEnabled()
 		return true
 	}
 	return h.Store.CompatStripReferenceMarkers()
 }
 var (
--- a/internal/httpapi/claude/handler_stream_test.go
+++ b/internal/httpapi/claude/handler_stream_test.go
@@ -28,6 +28,18 @@ func makeClaudeSSEHTTPResponse(lines ...string) *http.Response {
 	}
 }
 func makeClaudeContentLine(t *testing.T, text string) string {
 	t.Helper()
 	line, err := json.Marshal(map[string]any{
 		"p": "response/content",
 		"v": text,
 	})
 	if err != nil {
 		t.Fatalf("marshal content line failed: %v", err)
 	}
 	return "data: " + string(line)
 }
 func parseClaudeFrames(t *testing.T, body string) []claudeFrame {
 	t.Helper()
 	chunks := strings.Split(body, "\n\n")
@@ -71,6 +83,17 @@ func findClaudeFrames(frames []claudeFrame, event string) []claudeFrame {
 	return out
 }
 func collectClaudeTextDeltas(frames []claudeFrame) string {
 	var combined strings.Builder
 	for _, f := range findClaudeFrames(frames, "content_block_delta") {
 		delta, _ := f.Payload["delta"].(map[string]any)
 		if delta["type"] == "text_delta" {
 			combined.WriteString(asString(delta["text"]))
 		}
 	}
 	return combined.String()
 }
 func TestHandleClaudeStreamRealtimeTextIncrementsWithEventHeaders(t *testing.T) {
 	h := &Handler{}
 	resp := makeClaudeSSEHTTPResponse(
@@ -96,8 +119,8 @@ func TestHandleClaudeStreamRealtimeTextIncrementsWithEventHeaders(t *testing.T)
 	frames := parseClaudeFrames(t, body)
 	deltas := findClaudeFrames(frames, "content_block_delta")
-	if len(deltas) < 2 {
+	if len(deltas) < 1 {
-		t.Fatalf("expected at least 2 text deltas, got=%d body=%s", len(deltas), body)
+		t.Fatalf("expected at least 1 text delta, got=%d body=%s", len(deltas), body)
 	}
 	combined := strings.Builder{}
 	for _, f := range deltas {
@@ -111,6 +134,52 @@ func TestHandleClaudeStreamRealtimeTextIncrementsWithEventHeaders(t *testing.T)
 	}
 }
 func TestHandleClaudeStreamRealtimeToolBufferedPlainTextDoesNotRepeatFinalText(t *testing.T) {
 	h := &Handler{}
 	want := "明白\n\nBash\nIN\npwd\nOUT\nok"
 	resp := makeClaudeSSEHTTPResponse(
 		makeClaudeContentLine(t, "明"),
 		makeClaudeContentLine(t, "白\n\nBash\nIN\npwd\n"),
 		makeClaudeContentLine(t, "OUT\nok"),
 		`data: [DONE]`,
 	)
 	rec := httptest.NewRecorder()
 	req := httptest.NewRequest(http.MethodPost, "/anthropic/v1/messages", nil)
 	h.handleClaudeStreamRealtime(rec, req, resp, "claude-sonnet-4-5", []any{map[string]any{"role": "user", "content": "use tool"}}, false, false, []string{"Bash"}, nil)
 	frames := parseClaudeFrames(t, rec.Body.String())
 	if got := collectClaudeTextDeltas(frames); got != want {
 		t.Fatalf("unexpected combined text: got %q want %q body=%s", got, want, rec.Body.String())
 	}
 }
 func TestHandleClaudeStreamRealtimeTrimsContinuationReplay(t *testing.T) {
 	h := &Handler{}
 	prefix := strings.Repeat("A", 40)
 	resp := makeClaudeSSEHTTPResponse(
 		`data: {"p":"response/content","v":"`+prefix+`"}`,
 		`data: {"p":"response/content","v":"`+prefix+` tail"}`,
 		`data: [DONE]`,
 	)
 	rec := httptest.NewRecorder()
 	req := httptest.NewRequest(http.MethodPost, "/anthropic/v1/messages", nil)
 	h.handleClaudeStreamRealtime(rec, req, resp, "claude-sonnet-4-5", []any{map[string]any{"role": "user", "content": "hi"}}, false, false, nil, nil)
 	frames := parseClaudeFrames(t, rec.Body.String())
 	combined := strings.Builder{}
 	for _, f := range findClaudeFrames(frames, "content_block_delta") {
 		delta, _ := f.Payload["delta"].(map[string]any)
 		if delta["type"] == "text_delta" {
 			combined.WriteString(asString(delta["text"]))
 		}
 	}
 	if got, want := combined.String(), prefix+" tail"; got != want {
 		t.Fatalf("unexpected combined text: got %q want %q body=%s", got, want, rec.Body.String())
 	}
 }
 func TestHandleClaudeStreamRealtimeThinkingDelta(t *testing.T) {
 	h := &Handler{}
 	resp := makeClaudeSSEHTTPResponse(
--- a/internal/httpapi/claude/handler_tokens.go
+++ b/internal/httpapi/claude/handler_tokens.go
@@ -3,8 +3,6 @@ package claude
 import (
 	"encoding/json"
 	"net/http"
 	"ds2api/internal/util"
 )
 func (h *Handler) CountTokens(w http.ResponseWriter, r *http.Request) {
@@ -26,26 +24,11 @@ func (h *Handler) CountTokens(w http.ResponseWriter, r *http.Request) {
 		writeClaudeError(w, http.StatusBadRequest, "Request must include 'model' and 'messages'.")
 		return
 	}
-	inputTokens := 0
+	normalized, err := normalizeClaudeRequest(h.Store, req)
-	if sys, ok := req["system"].(string); ok {
+	if err != nil {
-		inputTokens += util.EstimateTokens(sys)
+		writeClaudeError(w, http.StatusBadRequest, err.Error())
-	}
+		return
 	for _, item := range messages {
 		msg, ok := item.(map[string]any)
 		if !ok {
 			continue
 		}
 		inputTokens += 2
 		inputTokens += util.EstimateTokens(extractMessageContent(msg["content"]))
 	}
 	if tools, ok := req["tools"].([]any); ok {
 		for _, t := range tools {
 			b, _ := json.Marshal(t)
 			inputTokens += util.EstimateTokens(string(b))
 		}
 	}
 	if inputTokens < 1 {
 		inputTokens = 1
 	}
 	inputTokens := countClaudeInputTokens(normalized.Standard)
 	writeJSON(w, http.StatusOK, map[string]any{"input_tokens": inputTokens})
 }
--- a/internal/httpapi/claude/prompt_token_text.go
+++ b/internal/httpapi/claude/prompt_token_text.go
@@ -0,0 +1,7 @@
 package claude
 import "ds2api/internal/prompt"
 func buildClaudePromptTokenText(messages []any, thinkingEnabled bool) string {
 	return prompt.MessagesPrepareWithThinking(toMessageMaps(messages), thinkingEnabled)
 }
--- a/internal/httpapi/claude/proxy_vercel_test.go
+++ b/internal/httpapi/claude/proxy_vercel_test.go
@@ -14,7 +14,8 @@ type claudeProxyStoreStub struct {
 func (s claudeProxyStoreStub) ModelAliases() map[string]string { return s.aliases }
-func (claudeProxyStoreStub) CompatStripReferenceMarkers() bool { return true }
+func (claudeProxyStoreStub) CurrentInputFileEnabled() bool { return true }
 func (claudeProxyStoreStub) CurrentInputFileMinChars() int { return 0 }
 type openAIProxyStub struct {
 	status int
@@ -166,7 +167,7 @@ func TestClaudeProxyViaOpenAIEnablesThinkingWhenRequested(t *testing.T) {
 	}
 }
-func TestClaudeProxyViaOpenAIKeepsStreamDefaultThinkingDisabled(t *testing.T) {
+func TestClaudeProxyViaOpenAIEnablesStreamThinkingByDefault(t *testing.T) {
 	openAI := &openAIProxyCaptureStub{}
 	h := &Handler{
 		Store:  claudeProxyStoreStub{aliases: map[string]string{"claude-sonnet-4-6": "deepseek-v4-flash"}},
@@ -178,12 +179,12 @@ func TestClaudeProxyViaOpenAIKeepsStreamDefaultThinkingDisabled(t *testing.T) {
 	h.Messages(rec, req)
 	thinking, _ := openAI.seenReq["thinking"].(map[string]any)
-	if thinking["type"] != "disabled" {
+	if thinking["type"] != "enabled" {
-		t.Fatalf("expected Claude stream default to keep downstream thinking disabled, got %#v", openAI.seenReq)
+		t.Fatalf("expected Claude stream default to enable downstream thinking, got %#v", openAI.seenReq)
 	}
 }
-func TestClaudeProxyViaOpenAIStripsThinkingBlocksFromNonStreamResponse(t *testing.T) {
+func TestClaudeProxyViaOpenAIExposesThinkingBlocksByDefault(t *testing.T) {
 	body := `{"id":"chatcmpl_1","object":"chat.completion","created":1,"model":"claude-sonnet-4-5","choices":[{"index":0,"message":{"role":"assistant","content":null,"reasoning_content":"internal reasoning","tool_calls":[{"id":"call_1","type":"function","function":{"name":"search","arguments":"{\"q\":\"x\"}"}}]},"finish_reason":"tool_calls"}],"usage":{"prompt_tokens":1,"completion_tokens":1,"total_tokens":2}}`
 	h := &Handler{OpenAI: openAIProxyStub{status: 200, body: body}}
 	req := httptest.NewRequest(http.MethodPost, "/anthropic/v1/messages", strings.NewReader(`{"model":"claude-sonnet-4-5","messages":[{"role":"user","content":"hi"}],"stream":false}`))
@@ -195,14 +196,31 @@ func TestClaudeProxyViaOpenAIStripsThinkingBlocksFromNonStreamResponse(t *testin
 		t.Fatalf("unexpected status: %d body=%s", rec.Code, rec.Body.String())
 	}
 	got := rec.Body.String()
-	if strings.Contains(got, `"type":"thinking"`) {
+	if !strings.Contains(got, `"type":"thinking"`) {
-		t.Fatalf("expected converted Claude response to strip thinking block, got %s", got)
+		t.Fatalf("expected converted Claude response to expose thinking block, got %s", got)
 	}
 	if !strings.Contains(got, `"tool_use"`) {
 		t.Fatalf("expected converted Claude response to preserve tool_use, got %s", got)
 	}
 }
 func TestClaudeProxyViaOpenAIStripsThinkingBlocksWhenDisabled(t *testing.T) {
 	body := `{"id":"chatcmpl_1","object":"chat.completion","created":1,"model":"claude-sonnet-4-5","choices":[{"index":0,"message":{"role":"assistant","content":"ok","reasoning_content":"internal reasoning"},"finish_reason":"stop"}],"usage":{"prompt_tokens":1,"completion_tokens":1,"total_tokens":2}}`
 	h := &Handler{OpenAI: openAIProxyStub{status: 200, body: body}}
 	req := httptest.NewRequest(http.MethodPost, "/anthropic/v1/messages", strings.NewReader(`{"model":"claude-sonnet-4-5","messages":[{"role":"user","content":"hi"}],"thinking":{"type":"disabled"},"stream":false}`))
 	rec := httptest.NewRecorder()
 	h.Messages(rec, req)
 	if rec.Code != http.StatusOK {
 		t.Fatalf("unexpected status: %d body=%s", rec.Code, rec.Body.String())
 	}
 	got := rec.Body.String()
 	if strings.Contains(got, `"type":"thinking"`) {
 		t.Fatalf("expected disabled thinking to strip thinking block, got %s", got)
 	}
 }
 func TestClaudeProxyTranslatesInlineImageToOpenAIDataURL(t *testing.T) {
 	openAI := &openAIProxyCaptureStub{}
 	h := &Handler{OpenAI: openAI}
--- a/internal/httpapi/claude/standard_request.go
+++ b/internal/httpapi/claude/standard_request.go
@@ -32,11 +32,11 @@ func normalizeClaudeRequest(store ConfigReader, req map[string]any) (claudeNorma
 	dsPayload := convertClaudeToDeepSeek(payload, store)
 	dsModel, _ := dsPayload["model"].(string)
-	_, searchEnabled, ok := config.GetModelConfig(dsModel)
+	defaultThinkingEnabled, searchEnabled, ok := config.GetModelConfig(dsModel)
 	if !ok {
 		searchEnabled = false
 	}
-	thinkingEnabled := util.ResolveThinkingEnabled(req, false)
+	thinkingEnabled := util.ResolveThinkingEnabled(req, defaultThinkingEnabled)
 	if config.IsNoThinkingModel(dsModel) {
 		thinkingEnabled = false
 	}
@@ -48,17 +48,18 @@ func normalizeClaudeRequest(store ConfigReader, req map[string]any) (claudeNorma
 	return claudeNormalizedRequest{
 		Standard: promptcompat.StandardRequest{
-			Surface:        "anthropic_messages",
+			Surface:         "anthropic_messages",
-			RequestedModel: strings.TrimSpace(model),
+			RequestedModel:  strings.TrimSpace(model),
-			ResolvedModel:  dsModel,
+			ResolvedModel:   dsModel,
-			ResponseModel:  strings.TrimSpace(model),
+			ResponseModel:   strings.TrimSpace(model),
-			Messages:       payload["messages"].([]any),
+			Messages:        payload["messages"].([]any),
-			ToolsRaw:       toolsRequested,
+			PromptTokenText: finalPrompt,
-			FinalPrompt:    finalPrompt,
+			ToolsRaw:        toolsRequested,
-			ToolNames:      toolNames,
+			FinalPrompt:     finalPrompt,
-			Stream:         util.ToBool(req["stream"]),
+			ToolNames:       toolNames,
-			Thinking:       thinkingEnabled,
+			Stream:          util.ToBool(req["stream"]),
-			Search:         searchEnabled,
+			Thinking:        thinkingEnabled,
 			Search:          searchEnabled,
 		},
 		NormalizedMessages: normalizedMessages,
 	}, nil
--- a/internal/httpapi/claude/stream_runtime_core.go
+++ b/internal/httpapi/claude/stream_runtime_core.go
@@ -6,8 +6,11 @@ import (
 	"strings"
 	"time"
 	"ds2api/internal/responsehistory"
 	"ds2api/internal/sse"
 	streamengine "ds2api/internal/stream"
 	"ds2api/internal/toolcall"
 	"ds2api/internal/toolstream"
 )
 type claudeStreamRuntime struct {
@@ -15,10 +18,11 @@ type claudeStreamRuntime struct {
 	rc       *http.ResponseController
 	canFlush bool
-	model     string
+	model           string
-	toolNames []string
+	toolNames       []string
-	messages  []any
+	messages        []any
-	toolsRaw  any
+	toolsRaw        any
 	promptTokenText string
 	thinkingEnabled       bool
 	searchEnabled         bool
@@ -29,13 +33,21 @@ type claudeStreamRuntime struct {
 	thinking  strings.Builder
 	text      strings.Builder
 	sieve                 toolstream.State
 	rawText               strings.Builder
 	rawThinking           strings.Builder
 	toolDetectionThinking strings.Builder
 	toolCallsDetected     bool
 	nextBlockIndex     int
 	thinkingBlockOpen  bool
 	thinkingBlockIndex int
 	textBlockOpen      bool
 	textBlockIndex     int
 	textEmitted        bool
 	ended              bool
 	upstreamErr        string
 	history            *responsehistory.Session
 }
 func newClaudeStreamRuntime(
@@ -49,6 +61,8 @@ func newClaudeStreamRuntime(
 	stripReferenceMarkers bool,
 	toolNames []string,
 	toolsRaw any,
 	promptTokenText string,
 	history *responsehistory.Session,
 ) *claudeStreamRuntime {
 	return &claudeStreamRuntime{
 		w:                     w,
@@ -62,6 +76,8 @@ func newClaudeStreamRuntime(
 		stripReferenceMarkers: stripReferenceMarkers,
 		toolNames:             toolNames,
 		toolsRaw:              toolsRaw,
 		promptTokenText:       promptTokenText,
 		history:               history,
 		messageID:             fmt.Sprintf("msg_%d", time.Now().UnixNano()),
 		thinkingBlockIndex:    -1,
 		textBlockIndex:        -1,
@@ -81,8 +97,28 @@ func (s *claudeStreamRuntime) onParsed(parsed sse.LineResult) streamengine.Parse
 	}
 	contentSeen := false
 	for _, p := range parsed.ToolDetectionThinkingParts {
 		trimmed := sse.TrimContinuationOverlapFromBuilder(&s.toolDetectionThinking, p.Text)
 		if trimmed != "" {
 			s.toolDetectionThinking.WriteString(trimmed)
 		}
 	}
 	for _, p := range parsed.Parts {
-		cleanedText := cleanVisibleOutput(p.Text, s.stripReferenceMarkers)
+		var rawTrimmed string
 		if p.Type == "thinking" {
 			rawTrimmed = sse.TrimContinuationOverlapFromBuilder(&s.rawThinking, p.Text)
 		} else {
 			rawTrimmed = sse.TrimContinuationOverlapFromBuilder(&s.rawText, p.Text)
 		}
 		if rawTrimmed == "" {
 			continue
 		}
 		if p.Type == "thinking" {
 			s.rawThinking.WriteString(rawTrimmed)
 		} else {
 			s.rawText.WriteString(rawTrimmed)
 		}
 		cleanedText := cleanVisibleOutput(rawTrimmed, s.stripReferenceMarkers)
 		if cleanedText == "" {
 			continue
 		}
@@ -95,7 +131,7 @@ func (s *claudeStreamRuntime) onParsed(parsed sse.LineResult) streamengine.Parse
 			if !s.thinkingEnabled {
 				continue
 			}
-			trimmed := sse.TrimContinuationOverlap(s.thinking.String(), cleanedText)
+			trimmed := sse.TrimContinuationOverlapFromBuilder(&s.thinking, cleanedText)
 			if trimmed == "" {
 				continue
 			}
@@ -125,44 +161,86 @@ func (s *claudeStreamRuntime) onParsed(parsed sse.LineResult) streamengine.Parse
 			continue
 		}
-		trimmed := sse.TrimContinuationOverlap(s.text.String(), cleanedText)
+		s.text.WriteString(cleanedText)
-		if trimmed == "" {
+
-			continue
+		if !s.bufferToolContent {
-		}
+			s.closeThinkingBlock()
-		s.text.WriteString(trimmed)
+			if !s.textBlockOpen {
-		if s.bufferToolContent {
+				s.textBlockIndex = s.nextBlockIndex
-			if hasUnclosedCodeFence(s.text.String()) {
+				s.nextBlockIndex++
-				continue
+				s.send("content_block_start", map[string]any{
 					"type":  "content_block_start",
 					"index": s.textBlockIndex,
 					"content_block": map[string]any{
 						"type": "text",
 						"text": "",
 					},
 				})
 				s.textBlockOpen = true
 			}
-			continue
+			s.send("content_block_delta", map[string]any{
-		}
+				"type":  "content_block_delta",
 		s.closeThinkingBlock()
 		if !s.textBlockOpen {
 			s.textBlockIndex = s.nextBlockIndex
 			s.nextBlockIndex++
 			s.send("content_block_start", map[string]any{
 				"type":  "content_block_start",
 				"index": s.textBlockIndex,
-				"content_block": map[string]any{
+				"delta": map[string]any{
-					"type": "text",
+					"type": "text_delta",
-					"text": "",
+					"text": cleanedText,
 				},
 			})
-			s.textBlockOpen = true
+			s.textEmitted = true
 			continue
 		}
 		events := toolstream.ProcessChunk(&s.sieve, rawTrimmed, s.toolNames)
 		for _, evt := range events {
 			if len(evt.ToolCalls) > 0 {
 				s.closeTextBlock()
 				s.toolCallsDetected = true
 				normalized := toolcall.NormalizeParsedToolCallsForSchemas(evt.ToolCalls, s.toolsRaw)
 				for _, tc := range normalized {
 					idx := s.nextBlockIndex
 					s.nextBlockIndex++
 					s.sendToolUseBlock(idx, tc)
 				}
 				continue
 			}
 			if evt.Content == "" {
 				continue
 			}
 			cleaned := cleanVisibleOutput(evt.Content, s.stripReferenceMarkers)
 			if cleaned == "" || (s.searchEnabled && sse.IsCitation(cleaned)) {
 				continue
 			}
 			s.closeThinkingBlock()
 			if !s.textBlockOpen {
 				s.textBlockIndex = s.nextBlockIndex
 				s.nextBlockIndex++
 				s.send("content_block_start", map[string]any{
 					"type":  "content_block_start",
 					"index": s.textBlockIndex,
 					"content_block": map[string]any{
 						"type": "text",
 						"text": "",
 					},
 				})
 				s.textBlockOpen = true
 			}
 			s.send("content_block_delta", map[string]any{
 				"type":  "content_block_delta",
 				"index": s.textBlockIndex,
 				"delta": map[string]any{
 					"type": "text_delta",
 					"text": cleaned,
 				},
 			})
 			s.textEmitted = true
 		}
 		s.send("content_block_delta", map[string]any{
 			"type":  "content_block_delta",
 			"index": s.textBlockIndex,
 			"delta": map[string]any{
 				"type": "text_delta",
 				"text": trimmed,
 			},
 		})
 	}
 	if s.history != nil {
 		s.history.Progress(
 			responsehistory.ThinkingForArchive(s.rawThinking.String(), s.toolDetectionThinking.String(), s.thinking.String()),
 			responsehistory.TextForArchive(s.rawText.String(), s.text.String()),
 		)
 	}
 	return streamengine.ParsedDecision{ContentSeen: contentSeen}
 }
 func hasUnclosedCodeFence(text string) bool {
 	return strings.Count(text, "```")%2 == 1
 }
--- a/internal/httpapi/claude/stream_runtime_emit.go
+++ b/internal/httpapi/claude/stream_runtime_emit.go
@@ -42,7 +42,10 @@ func (s *claudeStreamRuntime) sendPing() {
 }
 func (s *claudeStreamRuntime) sendMessageStart() {
-	inputTokens := util.EstimateTokens(fmt.Sprintf("%v", s.messages))
+	inputTokens := countClaudeInputTokensFromText(s.promptTokenText, s.model)
 	if inputTokens == 0 {
 		inputTokens = util.CountPromptTokens(fmt.Sprintf("%v", s.messages), s.model)
 	}
 	s.send("message_start", map[string]any{
 		"type": "message_start",
 		"message": map[string]any{
--- a/internal/httpapi/claude/stream_runtime_finalize.go
+++ b/internal/httpapi/claude/stream_runtime_finalize.go
@@ -1,13 +1,16 @@
 package claude
 import (
 	"ds2api/internal/assistantturn"
 	"ds2api/internal/responsehistory"
 	"ds2api/internal/sse"
 	"ds2api/internal/toolcall"
 	"ds2api/internal/toolstream"
 	"encoding/json"
 	"fmt"
 	"time"
 	streamengine "ds2api/internal/stream"
 	"ds2api/internal/util"
 )
 func (s *claudeStreamRuntime) closeThinkingBlock() {
@@ -34,6 +37,32 @@ func (s *claudeStreamRuntime) closeTextBlock() {
 	s.textBlockIndex = -1
 }
 func (s *claudeStreamRuntime) sendToolUseBlock(idx int, tc toolcall.ParsedToolCall) {
 	s.send("content_block_start", map[string]any{
 		"type":  "content_block_start",
 		"index": idx,
 		"content_block": map[string]any{
 			"type":  "tool_use",
 			"id":    fmt.Sprintf("toolu_%d_%d", time.Now().Unix(), idx),
 			"name":  tc.Name,
 			"input": map[string]any{},
 		},
 	})
 	inputBytes, _ := json.Marshal(tc.Input)
 	s.send("content_block_delta", map[string]any{
 		"type":  "content_block_delta",
 		"index": idx,
 		"delta": map[string]any{
 			"type":         "input_json_delta",
 			"partial_json": string(inputBytes),
 		},
 	})
 	s.send("content_block_stop", map[string]any{
 		"type":  "content_block_stop",
 		"index": idx,
 	})
 }
 func (s *claudeStreamRuntime) finalize(stopReason string) {
 	if s.ended {
 		return
@@ -41,49 +70,83 @@ func (s *claudeStreamRuntime) finalize(stopReason string) {
 	s.ended = true
 	s.closeThinkingBlock()
 	s.closeTextBlock()
 	finalThinking := s.thinking.String()
 	finalText := cleanVisibleOutput(s.text.String(), s.stripReferenceMarkers)
 	if s.bufferToolContent {
-		detected := toolcall.ParseStandaloneToolCalls(finalText, s.toolNames)
+		for _, evt := range toolstream.Flush(&s.sieve, s.toolNames) {
-		if len(detected) == 0 && finalText == "" && finalThinking != "" {
+			if len(evt.ToolCalls) > 0 {
-			detected = toolcall.ParseStandaloneToolCalls(finalThinking, s.toolNames)
+				s.closeTextBlock()
-		}
+				s.toolCallsDetected = true
-		if len(detected) > 0 {
+				normalized := toolcall.NormalizeParsedToolCallsForSchemas(evt.ToolCalls, s.toolsRaw)
-			detected = toolcall.NormalizeParsedToolCallsForSchemas(detected, s.toolsRaw)
+				for _, tc := range normalized {
-			stopReason = "tool_use"
+					idx := s.nextBlockIndex
-			for i, tc := range detected {
+					s.nextBlockIndex++
-				idx := s.nextBlockIndex + i
+					s.sendToolUseBlock(idx, tc)
-				s.send("content_block_start", map[string]any{
+				}
-					"type":  "content_block_start",
+				continue
-					"index": idx,
+			}
-					"content_block": map[string]any{
+			if evt.Content != "" {
-						"type":  "tool_use",
+				cleaned := cleanVisibleOutput(evt.Content, s.stripReferenceMarkers)
-						"id":    fmt.Sprintf("toolu_%d_%d", time.Now().Unix(), idx),
+				if cleaned == "" || (s.searchEnabled && sse.IsCitation(cleaned)) {
-						"name":  tc.Name,
+					continue
-						"input": map[string]any{},
+				}
-					},
+				if !s.textBlockOpen {
-				})
+					s.textBlockIndex = s.nextBlockIndex
-
+					s.nextBlockIndex++
-				inputBytes, _ := json.Marshal(tc.Input)
+					s.send("content_block_start", map[string]any{
 						"type":  "content_block_start",
 						"index": s.textBlockIndex,
 						"content_block": map[string]any{
 							"type": "text",
 							"text": "",
 						},
 					})
 					s.textBlockOpen = true
 				}
 				s.send("content_block_delta", map[string]any{
 					"type":  "content_block_delta",
-					"index": idx,
+					"index": s.textBlockIndex,
 					"delta": map[string]any{
-						"type":         "input_json_delta",
+						"type": "text_delta",
-						"partial_json": string(inputBytes),
+						"text": cleaned,
 					},
 				})
-
+				s.textEmitted = true
 				s.send("content_block_stop", map[string]any{
 					"type":  "content_block_stop",
 					"index": idx,
 				})
 			}
-			s.nextBlockIndex += len(detected)
+		}
-		} else if finalText != "" {
+	}
 	s.closeTextBlock()
 	turn := assistantturn.BuildTurnFromStreamSnapshot(assistantturn.StreamSnapshot{
 		RawText:               s.rawText.String(),
 		VisibleText:           s.text.String(),
 		RawThinking:           s.rawThinking.String(),
 		VisibleThinking:       s.thinking.String(),
 		DetectionThinking:     s.toolDetectionThinking.String(),
 		AlreadyEmittedCalls:   s.toolCallsDetected,
 		AlreadyEmittedToolRaw: s.toolCallsDetected,
 	}, assistantturn.BuildOptions{
 		Model:                 s.model,
 		Prompt:                s.promptTokenText,
 		SearchEnabled:         s.searchEnabled,
 		StripReferenceMarkers: s.stripReferenceMarkers,
 		ToolNames:             s.toolNames,
 		ToolsRaw:              s.toolsRaw,
 	})
 	finalText := turn.Text
 	outcome := assistantturn.FinalizeTurn(turn, assistantturn.FinalizeOptions{
 		AlreadyEmittedToolCalls: s.toolCallsDetected,
 	})
 	if s.bufferToolContent && !s.toolCallsDetected {
 		if len(turn.ToolCalls) > 0 {
 			stopReason = "tool_use"
 			for _, tc := range turn.ToolCalls {
 				idx := s.nextBlockIndex
 				s.nextBlockIndex++
 				s.sendToolUseBlock(idx, tc)
 			}
 		} else if finalText != "" && !s.textEmitted {
 			idx := s.nextBlockIndex
 			s.nextBlockIndex++
 			s.send("content_block_start", map[string]any{
@@ -102,6 +165,7 @@ func (s *claudeStreamRuntime) finalize(stopReason string) {
 					"text": finalText,
 				},
 			})
 			s.textEmitted = true
 			s.send("content_block_stop", map[string]any{
 				"type":  "content_block_stop",
 				"index": idx,
@@ -109,7 +173,19 @@ func (s *claudeStreamRuntime) finalize(stopReason string) {
 		}
 	}
-	outputTokens := util.EstimateTokens(finalThinking) + util.EstimateTokens(finalText)
+	if outcome.HasToolCalls {
 		stopReason = "tool_use"
 	}
 	if s.history != nil {
 		s.history.Success(
 			200,
 			responsehistory.ThinkingForArchive(turn.RawThinking, turn.DetectionThinking, turn.Thinking),
 			responsehistory.TextForArchive(turn.RawText, turn.Text),
 			stopReason,
 			responsehistory.GenericUsage(turn),
 		)
 	}
 	s.send("message_delta", map[string]any{
 		"type": "message_delta",
 		"delta": map[string]any{
@@ -117,7 +193,7 @@ func (s *claudeStreamRuntime) finalize(stopReason string) {
 			"stop_sequence": nil,
 		},
 		"usage": map[string]any{
-			"output_tokens": outputTokens,
+			"output_tokens": outcome.Usage.OutputTokens,
 		},
 	})
 	s.send("message_stop", map[string]any{"type": "message_stop"})
@@ -125,10 +201,16 @@ func (s *claudeStreamRuntime) finalize(stopReason string) {
 func (s *claudeStreamRuntime) onFinalize(reason streamengine.StopReason, scannerErr error) {
 	if string(reason) == "upstream_error" {
 		if s.history != nil {
 			s.history.Error(500, s.upstreamErr, "upstream_error", responsehistory.ThinkingForArchive(s.rawThinking.String(), s.toolDetectionThinking.String(), s.thinking.String()), responsehistory.TextForArchive(s.rawText.String(), s.text.String()))
 		}
 		s.sendError(s.upstreamErr)
 		return
 	}
 	if scannerErr != nil {
 		if s.history != nil {
 			s.history.Error(500, scannerErr.Error(), "error", responsehistory.ThinkingForArchive(s.rawThinking.String(), s.toolDetectionThinking.String(), s.thinking.String()), responsehistory.TextForArchive(s.rawText.String(), s.text.String()))
 		}
 		s.sendError(scannerErr.Error())
 		return
 	}
--- a/internal/httpapi/claude/stream_status_test.go
+++ b/internal/httpapi/claude/stream_status_test.go
@@ -23,7 +23,8 @@ type streamStatusClaudeStoreStub struct{}
 func (streamStatusClaudeStoreStub) ModelAliases() map[string]string { return nil }
-func (streamStatusClaudeStoreStub) CompatStripReferenceMarkers() bool { return true }
+func (streamStatusClaudeStoreStub) CurrentInputFileEnabled() bool { return true }
 func (streamStatusClaudeStoreStub) CurrentInputFileMinChars() int { return 0 }
 func captureClaudeStatusMiddleware(statuses *[]int) func(http.Handler) http.Handler {
 	return func(next http.Handler) http.Handler {
--- a/internal/httpapi/claude/token_count.go
+++ b/internal/httpapi/claude/token_count.go
@@ -0,0 +1,20 @@
 package claude
 import (
 	"strings"
 	"ds2api/internal/promptcompat"
 	"ds2api/internal/util"
 )
 func countClaudeInputTokens(stdReq promptcompat.StandardRequest) int {
 	promptText := stdReq.PromptTokenText
 	if strings.TrimSpace(promptText) == "" {
 		promptText = stdReq.FinalPrompt
 	}
 	return countClaudeInputTokensFromText(promptText, stdReq.ResolvedModel)
 }
 func countClaudeInputTokensFromText(promptText, model string) int {
 	return util.CountPromptTokens(promptText, model)
 }
--- a/internal/httpapi/gemini/convert_request.go
+++ b/internal/httpapi/gemini/convert_request.go
@@ -33,19 +33,24 @@ func normalizeGeminiRequest(store ConfigReader, routeModel string, req map[strin
 	toolsRaw := convertGeminiTools(req["tools"])
 	finalPrompt, toolNames := promptcompat.BuildOpenAIPromptForAdapter(messagesRaw, toolsRaw, "", thinkingEnabled)
 	if len(toolNames) == 0 && len(toolsRaw) > 0 {
 		toolNames = []string{"__any_tool__"}
 	}
 	passThrough := collectGeminiPassThrough(req)
 	return promptcompat.StandardRequest{
-		Surface:        "google_gemini",
+		Surface:         "google_gemini",
-		RequestedModel: requestedModel,
+		RequestedModel:  requestedModel,
-		ResolvedModel:  resolvedModel,
+		ResolvedModel:   resolvedModel,
-		ResponseModel:  requestedModel,
+		ResponseModel:   requestedModel,
-		Messages:       messagesRaw,
+		Messages:        messagesRaw,
-		FinalPrompt:    finalPrompt,
+		PromptTokenText: finalPrompt,
-		ToolNames:      toolNames,
+		ToolsRaw:        toolsRaw,
-		Stream:         stream,
+		FinalPrompt:     finalPrompt,
-		Thinking:       thinkingEnabled,
+		ToolNames:       toolNames,
-		Search:         searchEnabled,
+		Stream:          stream,
-		PassThrough:    passThrough,
+		Thinking:        thinkingEnabled,
 		Search:          searchEnabled,
 		PassThrough:     passThrough,
 	}, nil
 }
--- a/internal/httpapi/gemini/deps.go
+++ b/internal/httpapi/gemini/deps.go
@@ -17,12 +17,14 @@ type AuthResolver interface {
 type DeepSeekCaller interface {
 	CreateSession(ctx context.Context, a *auth.RequestAuth, maxAttempts int) (string, error)
 	GetPow(ctx context.Context, a *auth.RequestAuth, maxAttempts int) (string, error)
 	UploadFile(ctx context.Context, a *auth.RequestAuth, req dsclient.UploadFileRequest, maxAttempts int) (*dsclient.UploadFileResult, error)
 	CallCompletion(ctx context.Context, a *auth.RequestAuth, payload map[string]any, powResp string, maxAttempts int) (*http.Response, error)
 }
 type ConfigReader interface {
 	ModelAliases() map[string]string
-	CompatStripReferenceMarkers() bool
+	CurrentInputFileEnabled() bool
 	CurrentInputFileMinChars() int
 }
 type OpenAIChatRunner interface {
--- a/internal/httpapi/gemini/handler_generate.go
+++ b/internal/httpapi/gemini/handler_generate.go
@@ -2,8 +2,9 @@ package gemini
 import (
 	"bytes"
-	"ds2api/internal/toolcall"
+	"context"
 	"encoding/json"
 	"errors"
 	"io"
 	"net/http"
 	"net/http/httptest"
@@ -11,7 +12,15 @@ import (
 	"github.com/go-chi/chi/v5"
 	"ds2api/internal/assistantturn"
 	"ds2api/internal/auth"
 	"ds2api/internal/completionruntime"
 	"ds2api/internal/httpapi/openai/history"
 	"ds2api/internal/httpapi/requestbody"
 	"ds2api/internal/promptcompat"
 	"ds2api/internal/responsehistory"
 	"ds2api/internal/sse"
 	"ds2api/internal/toolcall"
 	"ds2api/internal/translatorcliproxy"
 	"ds2api/internal/util"
@@ -19,20 +28,126 @@ import (
 )
 func (h *Handler) handleGenerateContent(w http.ResponseWriter, r *http.Request, stream bool) {
-	if h.OpenAI == nil {
+	if isGeminiVercelProxyRequest(r) && h.proxyViaOpenAI(w, r, stream) {
 		writeGeminiError(w, http.StatusInternalServerError, "OpenAI proxy backend unavailable.")
 		return
 	}
-	if h.proxyViaOpenAI(w, r, stream) {
+	if h.Auth == nil || h.DS == nil {
 		if h.OpenAI != nil && h.proxyViaOpenAI(w, r, stream) {
 			return
 		}
 		writeGeminiError(w, http.StatusInternalServerError, "Gemini runtime backend unavailable.")
 		return
 	}
-	writeGeminiError(w, http.StatusBadGateway, "Failed to proxy Gemini request.")
+	if h.handleGeminiDirect(w, r, stream) {
 		return
 	}
 	writeGeminiError(w, http.StatusBadGateway, "Failed to handle Gemini request.")
 }
 func isGeminiVercelProxyRequest(r *http.Request) bool {
 	if r == nil || r.URL == nil {
 		return false
 	}
 	return strings.TrimSpace(r.URL.Query().Get("__stream_prepare")) == "1" ||
 		strings.TrimSpace(r.URL.Query().Get("__stream_release")) == "1"
 }
 func (h *Handler) handleGeminiDirect(w http.ResponseWriter, r *http.Request, stream bool) bool {
 	raw, err := io.ReadAll(r.Body)
 	if err != nil {
 		if errors.Is(err, requestbody.ErrInvalidUTF8Body) {
 			writeGeminiError(w, http.StatusBadRequest, "invalid json")
 		} else {
 			writeGeminiError(w, http.StatusBadRequest, "invalid body")
 		}
 		return true
 	}
 	routeModel := strings.TrimSpace(chi.URLParam(r, "model"))
 	var req map[string]any
 	if err := json.Unmarshal(raw, &req); err != nil {
 		writeGeminiError(w, http.StatusBadRequest, "invalid json")
 		return true
 	}
 	stdReq, err := normalizeGeminiRequest(h.Store, routeModel, req, stream)
 	if err != nil {
 		writeGeminiError(w, http.StatusBadRequest, err.Error())
 		return true
 	}
 	a, err := h.Auth.Determine(r)
 	if err != nil {
 		writeGeminiError(w, http.StatusUnauthorized, err.Error())
 		return true
 	}
 	defer h.Auth.Release(a)
 	stdReq, err = h.applyCurrentInputFile(r.Context(), a, stdReq)
 	if err != nil {
 		status, message := mapCurrentInputFileError(err)
 		writeGeminiError(w, status, message)
 		return true
 	}
 	historySession := responsehistory.Start(responsehistory.StartParams{
 		Store:    h.ChatHistory,
 		Request:  r,
 		Auth:     a,
 		Surface:  "gemini.generate_content",
 		Standard: stdReq,
 	})
 	if stream {
 		h.handleGeminiDirectStream(w, r, a, stdReq, historySession)
 		return true
 	}
 	result, outErr := completionruntime.ExecuteNonStreamWithRetry(r.Context(), h.DS, a, stdReq, completionruntime.Options{
 		RetryEnabled:     true,
 		CurrentInputFile: h.Store,
 	})
 	if outErr != nil {
 		if historySession != nil {
 			historySession.ErrorTurn(outErr.Status, outErr.Message, outErr.Code, result.Turn)
 		}
 		writeGeminiError(w, outErr.Status, outErr.Message)
 		return true
 	}
 	if historySession != nil {
 		historySession.SuccessTurn(http.StatusOK, result.Turn, responsehistory.GenericUsage(result.Turn))
 	}
 	writeJSON(w, http.StatusOK, buildGeminiGenerateContentResponseFromTurn(result.Turn))
 	return true
 }
 func (h *Handler) applyCurrentInputFile(ctx context.Context, a *auth.RequestAuth, stdReq promptcompat.StandardRequest) (promptcompat.StandardRequest, error) {
 	if h == nil {
 		return stdReq, nil
 	}
 	return (history.Service{Store: h.Store, DS: h.DS}).ApplyCurrentInputFile(ctx, a, stdReq)
 }
 func mapCurrentInputFileError(err error) (int, string) {
 	return history.MapError(err)
 }
 func (h *Handler) handleGeminiDirectStream(w http.ResponseWriter, r *http.Request, a *auth.RequestAuth, stdReq promptcompat.StandardRequest, historySession *responsehistory.Session) {
 	start, outErr := completionruntime.StartCompletion(r.Context(), h.DS, a, stdReq, completionruntime.Options{
 		CurrentInputFile: h.Store,
 	})
 	if outErr != nil {
 		if historySession != nil {
 			historySession.Error(outErr.Status, outErr.Message, outErr.Code, "", "")
 		}
 		writeGeminiError(w, outErr.Status, outErr.Message)
 		return
 	}
 	streamReq := start.Request
 	h.handleStreamGenerateContent(w, r, start.Response, streamReq.ResponseModel, streamReq.PromptTokenText, streamReq.Thinking, streamReq.Search, streamReq.ToolNames, streamReq.ToolsRaw, historySession)
 }
 func (h *Handler) proxyViaOpenAI(w http.ResponseWriter, r *http.Request, stream bool) bool {
 	raw, err := io.ReadAll(r.Body)
 	if err != nil {
-		writeGeminiError(w, http.StatusBadRequest, "invalid body")
+		if errors.Is(err, requestbody.ErrInvalidUTF8Body) {
 			writeGeminiError(w, http.StatusBadRequest, "invalid json")
 		} else {
 			writeGeminiError(w, http.StatusBadRequest, "invalid body")
 		}
 		return true
 	}
 	routeModel := strings.TrimSpace(chi.URLParam(r, "model"))
@@ -214,12 +329,11 @@ func (h *Handler) handleNonStreamGenerateContent(w http.ResponseWriter, resp *ht
 	}
 	result := sse.CollectStream(resp, thinkingEnabled, true)
 	stripReferenceMarkers := h.compatStripReferenceMarkers()
 	writeJSON(w, http.StatusOK, buildGeminiGenerateContentResponse(
 		model,
 		finalPrompt,
-		cleanVisibleOutput(result.Thinking, stripReferenceMarkers),
+		cleanVisibleOutput(result.Thinking, false),
-		cleanVisibleOutput(result.Text, stripReferenceMarkers),
+		cleanVisibleOutput(result.Text, false),
 		toolNames,
 	))
 }
@@ -227,7 +341,7 @@ func (h *Handler) handleNonStreamGenerateContent(w http.ResponseWriter, resp *ht
 //nolint:unused // retained for native Gemini non-stream handling path.
 func buildGeminiGenerateContentResponse(model, finalPrompt, finalThinking, finalText string, toolNames []string) map[string]any {
 	parts := buildGeminiPartsFromFinal(finalText, finalThinking, toolNames)
-	usage := buildGeminiUsage(finalPrompt, finalThinking, finalText)
+	usage := buildGeminiUsage(model, finalPrompt, finalThinking, finalText)
 	return map[string]any{
 		"candidates": []map[string]any{
 			{
@@ -244,11 +358,65 @@ func buildGeminiGenerateContentResponse(model, finalPrompt, finalThinking, final
 	}
 }
 func buildGeminiGenerateContentResponseFromTurn(turn assistantturn.Turn) map[string]any {
 	parts := buildGeminiPartsFromTurn(turn)
 	return map[string]any{
 		"candidates": []map[string]any{
 			{
 				"index": 0,
 				"content": map[string]any{
 					"role":  "model",
 					"parts": parts,
 				},
 				"finishReason": "STOP",
 			},
 		},
 		"modelVersion": turn.Model,
 		"usageMetadata": map[string]any{
 			"promptTokenCount":     turn.Usage.InputTokens,
 			"candidatesTokenCount": turn.Usage.OutputTokens,
 			"totalTokenCount":      turn.Usage.TotalTokens,
 		},
 	}
 }
 func buildGeminiPartsFromTurn(turn assistantturn.Turn) []map[string]any {
 	thinkingPart := func() []map[string]any {
 		if turn.Thinking == "" {
 			return nil
 		}
 		return []map[string]any{{"text": turn.Thinking, "thought": true}}
 	}
 	if len(turn.ToolCalls) > 0 {
 		parts := thinkingPart()
 		if parts == nil {
 			parts = make([]map[string]any, 0, len(turn.ToolCalls))
 		}
 		for _, tc := range turn.ToolCalls {
 			parts = append(parts, map[string]any{
 				"functionCall": map[string]any{
 					"name": tc.Name,
 					"args": tc.Input,
 				},
 			})
 		}
 		return parts
 	}
 	parts := thinkingPart()
 	if turn.Text != "" {
 		parts = append(parts, map[string]any{"text": turn.Text})
 	}
 	if len(parts) == 0 {
 		parts = append(parts, map[string]any{"text": ""})
 	}
 	return parts
 }
 //nolint:unused // retained for native Gemini non-stream handling path.
-func buildGeminiUsage(finalPrompt, finalThinking, finalText string) map[string]any {
+func buildGeminiUsage(model, finalPrompt, finalThinking, finalText string) map[string]any {
-	promptTokens := util.EstimateTokens(finalPrompt)
+	promptTokens := util.CountPromptTokens(finalPrompt, model)
-	reasoningTokens := util.EstimateTokens(finalThinking)
+	reasoningTokens := util.CountOutputTokens(finalThinking, model)
-	completionTokens := util.EstimateTokens(finalText)
+	completionTokens := util.CountOutputTokens(finalText, model)
 	return map[string]any{
 		"promptTokenCount":     promptTokens,
 		"candidatesTokenCount": reasoningTokens + completionTokens,
@@ -262,8 +430,17 @@ func buildGeminiPartsFromFinal(finalText, finalThinking string, toolNames []stri
 	if len(detected) == 0 && finalThinking != "" {
 		detected = toolcall.ParseToolCalls(finalThinking, toolNames)
 	}
 	thinkingPart := func() []map[string]any {
 		if finalThinking == "" {
 			return nil
 		}
 		return []map[string]any{{"text": finalThinking, "thought": true}}
 	}
 	if len(detected) > 0 {
-		parts := make([]map[string]any, 0, len(detected))
+		parts := thinkingPart()
 		if parts == nil {
 			parts = make([]map[string]any, 0, len(detected))
 		}
 		for _, tc := range detected {
 			parts = append(parts, map[string]any{
 				"functionCall": map[string]any{
@@ -275,9 +452,12 @@ func buildGeminiPartsFromFinal(finalText, finalThinking string, toolNames []stri
 		return parts
 	}
-	text := finalText
+	parts := thinkingPart()
-	if text == "" {
+	if finalText != "" {
-		text = finalThinking
+		parts = append(parts, map[string]any{"text": finalText})
 	}
-	return []map[string]any{{"text": text}}
+	if len(parts) == 0 {
 		parts = append(parts, map[string]any{"text": ""})
 	}
 	return parts
 }
--- a/internal/httpapi/gemini/handler_routes.go
+++ b/internal/httpapi/gemini/handler_routes.go
@@ -5,24 +5,24 @@ import (
 	"github.com/go-chi/chi/v5"
 	"ds2api/internal/chathistory"
 	"ds2api/internal/textclean"
 	"ds2api/internal/util"
 )
 var writeJSON = util.WriteJSON
 type Handler struct {
-	Store  ConfigReader
+	Store       ConfigReader
-	Auth   AuthResolver
+	Auth        AuthResolver
-	DS     DeepSeekCaller
+	DS          DeepSeekCaller
-	OpenAI OpenAIChatRunner
+	OpenAI      OpenAIChatRunner
 	ChatHistory *chathistory.Store
 }
 //nolint:unused // used by native Gemini stream/non-stream runtime helpers.
-func (h *Handler) compatStripReferenceMarkers() bool {
+func stripReferenceMarkersEnabled() bool {
-	if h == nil || h.Store == nil {
+	return textclean.StripReferenceMarkersEnabled()
 		return true
 	}
 	return h.Store.CompatStripReferenceMarkers()
 }
 func RegisterRoutes(r chi.Router, h *Handler) {
--- a/internal/httpapi/gemini/handler_stream_runtime.go
+++ b/internal/httpapi/gemini/handler_stream_runtime.go
@@ -7,16 +7,25 @@ import (
 	"strings"
 	"time"
 	"ds2api/internal/assistantturn"
 	dsprotocol "ds2api/internal/deepseek/protocol"
 	"ds2api/internal/responsehistory"
 	"ds2api/internal/sse"
 	streamengine "ds2api/internal/stream"
 )
 //nolint:unused // retained for native Gemini stream handling path.
-func (h *Handler) handleStreamGenerateContent(w http.ResponseWriter, r *http.Request, resp *http.Response, model, finalPrompt string, thinkingEnabled, searchEnabled bool, toolNames []string) {
+func (h *Handler) handleStreamGenerateContent(w http.ResponseWriter, r *http.Request, resp *http.Response, model, finalPrompt string, thinkingEnabled, searchEnabled bool, toolNames []string, toolsRaw any, historySessions ...*responsehistory.Session) {
 	var historySession *responsehistory.Session
 	if len(historySessions) > 0 {
 		historySession = historySessions[0]
 	}
 	defer func() { _ = resp.Body.Close() }()
 	if resp.StatusCode != http.StatusOK {
 		body, _ := io.ReadAll(resp.Body)
 		if historySession != nil {
 			historySession.Error(resp.StatusCode, strings.TrimSpace(string(body)), "error", "", "")
 		}
 		writeGeminiError(w, resp.StatusCode, strings.TrimSpace(string(body)))
 		return
 	}
@@ -28,7 +37,7 @@ func (h *Handler) handleStreamGenerateContent(w http.ResponseWriter, r *http.Req
 	rc := http.NewResponseController(w)
 	_, canFlush := w.(http.Flusher)
-	runtime := newGeminiStreamRuntime(w, rc, canFlush, model, finalPrompt, thinkingEnabled, searchEnabled, h.compatStripReferenceMarkers(), toolNames)
+	runtime := newGeminiStreamRuntime(w, rc, canFlush, model, finalPrompt, thinkingEnabled, searchEnabled, stripReferenceMarkersEnabled(), toolNames, toolsRaw, historySession)
 	initialType := "text"
 	if thinkingEnabled {
@@ -64,9 +73,12 @@ type geminiStreamRuntime struct {
 	bufferContent         bool
 	stripReferenceMarkers bool
 	toolNames             []string
 	toolsRaw              any
-	thinking strings.Builder
+	accumulator       *assistantturn.Accumulator
-	text     strings.Builder
+	contentFilter     bool
 	responseMessageID int
 	history           *responsehistory.Session
 }
 //nolint:unused // retained for native Gemini stream handling path.
@@ -80,6 +92,8 @@ func newGeminiStreamRuntime(
 	searchEnabled bool,
 	stripReferenceMarkers bool,
 	toolNames []string,
 	toolsRaw any,
 	history *responsehistory.Session,
 ) *geminiStreamRuntime {
 	return &geminiStreamRuntime{
 		w:                     w,
@@ -92,6 +106,13 @@ func newGeminiStreamRuntime(
 		bufferContent:         len(toolNames) > 0,
 		stripReferenceMarkers: stripReferenceMarkers,
 		toolNames:             toolNames,
 		toolsRaw:              toolsRaw,
 		history:               history,
 		accumulator: assistantturn.NewAccumulator(assistantturn.AccumulatorOptions{
 			ThinkingEnabled:       thinkingEnabled,
 			SearchEnabled:         searchEnabled,
 			StripReferenceMarkers: stripReferenceMarkers,
 		}),
 	}
 }
@@ -111,35 +132,39 @@ func (s *geminiStreamRuntime) onParsed(parsed sse.LineResult) streamengine.Parse
 	if !parsed.Parsed {
 		return streamengine.ParsedDecision{}
 	}
 	if parsed.ResponseMessageID > 0 {
 		s.responseMessageID = parsed.ResponseMessageID
 	}
 	if parsed.ContentFilter || parsed.ErrorMessage != "" || parsed.Stop {
 		if parsed.ContentFilter {
 			s.contentFilter = true
 		}
 		return streamengine.ParsedDecision{Stop: true}
 	}
-	contentSeen := false
+	accumulated := s.accumulator.Apply(parsed)
-	for _, p := range parsed.Parts {
+	for _, p := range accumulated.Parts {
 		cleanedText := cleanVisibleOutput(p.Text, s.stripReferenceMarkers)
 		if cleanedText == "" {
 			continue
 		}
 		if p.Type != "thinking" && s.searchEnabled && sse.IsCitation(cleanedText) {
 			continue
 		}
 		contentSeen = true
 		if p.Type == "thinking" {
-			if s.thinkingEnabled {
+			if p.VisibleText == "" || s.bufferContent {
-				trimmed := sse.TrimContinuationOverlap(s.thinking.String(), cleanedText)
+				continue
 				if trimmed == "" {
 					continue
 				}
 				s.thinking.WriteString(trimmed)
 			}
 			s.sendChunk(map[string]any{
 				"candidates": []map[string]any{
 					{
 						"index": 0,
 						"content": map[string]any{
 							"role":  "model",
 							"parts": []map[string]any{{"text": p.VisibleText, "thought": true}},
 						},
 					},
 				},
 				"modelVersion": s.model,
 			})
 			continue
 		}
-		trimmed := sse.TrimContinuationOverlap(s.text.String(), cleanedText)
+		if p.RawText == "" || p.CitationOnly || p.VisibleText == "" {
 		if trimmed == "" {
 			continue
 		}
 		s.text.WriteString(trimmed)
 		if s.bufferContent {
 			continue
 		}
@@ -149,23 +174,55 @@ func (s *geminiStreamRuntime) onParsed(parsed sse.LineResult) streamengine.Parse
 					"index": 0,
 					"content": map[string]any{
 						"role":  "model",
-						"parts": []map[string]any{{"text": trimmed}},
+						"parts": []map[string]any{{"text": p.VisibleText}},
 					},
 				},
 			},
 			"modelVersion": s.model,
 		})
 	}
-	return streamengine.ParsedDecision{ContentSeen: contentSeen}
+	if s.history != nil {
 		rawText, text, rawThinking, thinking, detectionThinking := s.accumulator.Snapshot()
 		s.history.Progress(
 			responsehistory.ThinkingForArchive(rawThinking, detectionThinking, thinking),
 			responsehistory.TextForArchive(rawText, text),
 		)
 	}
 	return streamengine.ParsedDecision{ContentSeen: accumulated.ContentSeen}
 }
 //nolint:unused // retained for native Gemini stream handling path.
 func (s *geminiStreamRuntime) finalize() {
-	finalThinking := s.thinking.String()
+	rawText, text, rawThinking, thinking, detectionThinking := s.accumulator.Snapshot()
-	finalText := cleanVisibleOutput(s.text.String(), s.stripReferenceMarkers)
+	turn := assistantturn.BuildTurnFromStreamSnapshot(assistantturn.StreamSnapshot{
 		RawText:           rawText,
 		VisibleText:       text,
 		RawThinking:       rawThinking,
 		VisibleThinking:   thinking,
 		DetectionThinking: detectionThinking,
 		ContentFilter:     s.contentFilter,
 		ResponseMessageID: s.responseMessageID,
 	}, assistantturn.BuildOptions{
 		Model:                 s.model,
 		Prompt:                s.finalPrompt,
 		SearchEnabled:         s.searchEnabled,
 		StripReferenceMarkers: s.stripReferenceMarkers,
 		ToolNames:             s.toolNames,
 		ToolsRaw:              s.toolsRaw,
 	})
 	outcome := assistantturn.FinalizeTurn(turn, assistantturn.FinalizeOptions{})
 	if s.history != nil {
 		s.history.Success(
 			http.StatusOK,
 			responsehistory.ThinkingForArchive(turn.RawThinking, turn.DetectionThinking, turn.Thinking),
 			responsehistory.TextForArchive(turn.RawText, turn.Text),
 			assistantturn.FinishReason(turn),
 			responsehistory.GenericUsage(turn),
 		)
 	}
 	if s.bufferContent {
-		parts := buildGeminiPartsFromFinal(finalText, finalThinking, s.toolNames)
+		parts := buildGeminiPartsFromTurn(turn)
 		s.sendChunk(map[string]any{
 			"candidates": []map[string]any{
 				{
@@ -193,7 +250,11 @@ func (s *geminiStreamRuntime) finalize() {
 				"finishReason": "STOP",
 			},
 		},
-		"modelVersion":  s.model,
+		"modelVersion": s.model,
-		"usageMetadata": buildGeminiUsage(s.finalPrompt, finalThinking, finalText),
+		"usageMetadata": map[string]any{
 			"promptTokenCount":     outcome.Usage.InputTokens,
 			"candidatesTokenCount": outcome.Usage.OutputTokens,
 			"totalTokenCount":      outcome.Usage.TotalTokens,
 		},
 	})
 }
--- a/internal/httpapi/gemini/handler_test.go
+++ b/internal/httpapi/gemini/handler_test.go
@@ -7,18 +7,22 @@ import (
 	"io"
 	"net/http"
 	"net/http/httptest"
 	"path/filepath"
 	"strings"
 	"testing"
 	"github.com/go-chi/chi/v5"
 	"ds2api/internal/auth"
 	"ds2api/internal/chathistory"
 	dsclient "ds2api/internal/deepseek/client"
 )
 type testGeminiConfig struct{}
-func (testGeminiConfig) ModelAliases() map[string]string   { return nil }
+func (testGeminiConfig) ModelAliases() map[string]string { return nil }
-func (testGeminiConfig) CompatStripReferenceMarkers() bool { return true }
+func (testGeminiConfig) CurrentInputFileEnabled() bool   { return true }
 func (testGeminiConfig) CurrentInputFileMinChars() int   { return 0 }
 type testGeminiAuth struct {
 	a   *auth.RequestAuth
@@ -44,22 +48,31 @@ func (testGeminiAuth) Release(_ *auth.RequestAuth) {}
 //nolint:unused // reserved test double for native Gemini DS-call path coverage.
 type testGeminiDS struct {
-	resp *http.Response
+	resp        *http.Response
-	err  error
+	err         error
 	uploadCalls []dsclient.UploadFileRequest
 	payloads    []map[string]any
 }
 //nolint:unused // reserved test double for native Gemini DS-call path coverage.
-func (m testGeminiDS) CreateSession(_ context.Context, _ *auth.RequestAuth, _ int) (string, error) {
+func (m *testGeminiDS) CreateSession(_ context.Context, _ *auth.RequestAuth, _ int) (string, error) {
 	return "session-id", nil
 }
 //nolint:unused // reserved test double for native Gemini DS-call path coverage.
-func (m testGeminiDS) GetPow(_ context.Context, _ *auth.RequestAuth, _ int) (string, error) {
+func (m *testGeminiDS) GetPow(_ context.Context, _ *auth.RequestAuth, _ int) (string, error) {
 	return "pow", nil
 }
 //nolint:unused // reserved test double for native Gemini DS-call path coverage.
-func (m testGeminiDS) CallCompletion(_ context.Context, _ *auth.RequestAuth, _ map[string]any, _ string, _ int) (*http.Response, error) {
+func (m *testGeminiDS) UploadFile(_ context.Context, _ *auth.RequestAuth, req dsclient.UploadFileRequest, _ int) (*dsclient.UploadFileResult, error) {
 	m.uploadCalls = append(m.uploadCalls, req)
 	return &dsclient.UploadFileResult{ID: "file-gemini-history"}, nil
 }
 //nolint:unused // reserved test double for native Gemini DS-call path coverage.
 func (m *testGeminiDS) CallCompletion(_ context.Context, _ *auth.RequestAuth, payload map[string]any, _ string, _ int) (*http.Response, error) {
 	m.payloads = append(m.payloads, payload)
 	if m.err != nil {
 		return nil, m.err
 	}
@@ -123,6 +136,71 @@ func makeGeminiUpstreamResponse(lines ...string) *http.Response {
 	}
 }
 func TestGeminiDirectAppliesCurrentInputFile(t *testing.T) {
 	ds := &testGeminiDS{
 		resp: makeGeminiUpstreamResponse(`data: {"p":"response/content","v":"ok"}`),
 	}
 	historyStore := chathistory.New(filepath.Join(t.TempDir(), "history.json"))
 	h := &Handler{
 		Store:       testGeminiConfig{},
 		Auth:        testGeminiAuth{},
 		DS:          ds,
 		ChatHistory: historyStore,
 	}
 	reqBody := `{"contents":[{"role":"user","parts":[{"text":"hello from gemini"}]}]}`
 	req := httptest.NewRequest(http.MethodPost, "/v1beta/models/gemini-2.5-pro:generateContent", strings.NewReader(reqBody))
 	req.Header.Set("Content-Type", "application/json")
 	rec := httptest.NewRecorder()
 	r := chi.NewRouter()
 	RegisterRoutes(r, h)
 	r.ServeHTTP(rec, req)
 	if rec.Code != http.StatusOK {
 		t.Fatalf("expected 200, got %d body=%s", rec.Code, rec.Body.String())
 	}
 	if len(ds.uploadCalls) != 1 {
 		t.Fatalf("expected one current input upload, got %d", len(ds.uploadCalls))
 	}
 	if ds.uploadCalls[0].Filename != "DS2API_HISTORY.txt" {
 		t.Fatalf("unexpected upload filename: %q", ds.uploadCalls[0].Filename)
 	}
 	if len(ds.payloads) != 1 {
 		t.Fatalf("expected one completion payload, got %d", len(ds.payloads))
 	}
 	refIDs, _ := ds.payloads[0]["ref_file_ids"].([]any)
 	if len(refIDs) != 1 || refIDs[0] != "file-gemini-history" {
 		t.Fatalf("expected uploaded history ref id, got %#v", ds.payloads[0]["ref_file_ids"])
 	}
 	prompt, _ := ds.payloads[0]["prompt"].(string)
 	if !strings.Contains(prompt, "Continue from the latest state in the attached DS2API_HISTORY.txt context.") {
 		t.Fatalf("expected continuation prompt, got %q", prompt)
 	}
 	snapshot, err := historyStore.Snapshot()
 	if err != nil {
 		t.Fatalf("snapshot history: %v", err)
 	}
 	if len(snapshot.Items) != 1 {
 		t.Fatalf("expected one history item, got %d", len(snapshot.Items))
 	}
 	full, err := historyStore.Get(snapshot.Items[0].ID)
 	if err != nil {
 		t.Fatalf("get history item: %v", err)
 	}
 	if full.Surface != "gemini.generate_content" {
 		t.Fatalf("unexpected surface: %q", full.Surface)
 	}
 	if full.Content != "ok" {
 		t.Fatalf("expected raw upstream content, got %q", full.Content)
 	}
 	if full.HistoryText != string(ds.uploadCalls[0].Data) {
 		t.Fatalf("expected uploaded current input file to be persisted in history text")
 	}
 	if len(full.Messages) != 1 || !strings.Contains(full.Messages[0].Content, "Continue from the latest state in the attached DS2API_HISTORY.txt context.") {
 		t.Fatalf("expected persisted message to match upstream continuation prompt, got %#v", full.Messages)
 	}
 }
 func TestGeminiRoutesRegistered(t *testing.T) {
 	h := &Handler{
 		Store: testGeminiConfig{},
@@ -257,6 +335,56 @@ func TestStreamGenerateContentEmitsSSE(t *testing.T) {
 	}
 }
 func TestNativeStreamGenerateContentEmitsThoughtParts(t *testing.T) {
 	h := &Handler{}
 	resp := makeGeminiUpstreamResponse(
 		`data: {"p":"response/thinking_content","v":"think"}`,
 		`data: {"p":"response/content","v":"answer"}`,
 		`data: [DONE]`,
 	)
 	rec := httptest.NewRecorder()
 	req := httptest.NewRequest(http.MethodPost, "/v1beta/models/gemini-2.5-pro:streamGenerateContent", nil)
 	h.handleStreamGenerateContent(rec, req, resp, "gemini-2.5-pro", "prompt", true, false, nil, nil)
 	frames := extractGeminiSSEFrames(t, rec.Body.String())
 	if len(frames) < 2 {
 		t.Fatalf("expected thought and text stream frames, body=%s", rec.Body.String())
 	}
 	var gotThought, gotText string
 	for _, frame := range frames {
 		for _, part := range geminiPartsFromFrame(frame) {
 			if part["thought"] == true {
 				gotThought += asString(part["text"])
 			} else {
 				gotText += asString(part["text"])
 			}
 		}
 	}
 	if gotThought != "think" {
 		t.Fatalf("expected thought part, got %q body=%s", gotThought, rec.Body.String())
 	}
 	if !strings.Contains(gotText, "answer") {
 		t.Fatalf("expected text part answer, got %q body=%s", gotText, rec.Body.String())
 	}
 }
 func TestBuildGeminiPartsFromFinalIncludesThoughtPart(t *testing.T) {
 	parts := buildGeminiPartsFromFinal("answer", "think", nil)
 	if len(parts) != 2 {
 		t.Fatalf("expected thought + answer parts, got %#v", parts)
 	}
 	if parts[0]["thought"] != true || parts[0]["text"] != "think" {
 		t.Fatalf("expected first part to be thought, got %#v", parts[0])
 	}
 	if _, ok := parts[1]["thought"]; ok {
 		t.Fatalf("expected second part to be visible text, got %#v", parts[1])
 	}
 	if parts[1]["text"] != "answer" {
 		t.Fatalf("expected answer text, got %#v", parts[1])
 	}
 }
 func TestGeminiProxyTranslatesInlineImageToOpenAIDataURL(t *testing.T) {
 	openAI := &geminiOpenAISuccessStub{}
 	h := &Handler{Store: testGeminiConfig{}, OpenAI: openAI}
@@ -396,3 +524,21 @@ func extractGeminiSSEFrames(t *testing.T, body string) []map[string]any {
 	}
 	return out
 }
 func geminiPartsFromFrame(frame map[string]any) []map[string]any {
 	candidates, _ := frame["candidates"].([]any)
 	if len(candidates) == 0 {
 		return nil
 	}
 	c0, _ := candidates[0].(map[string]any)
 	content, _ := c0["content"].(map[string]any)
 	rawParts, _ := content["parts"].([]any)
 	parts := make([]map[string]any, 0, len(rawParts))
 	for _, raw := range rawParts {
 		part, _ := raw.(map[string]any)
 		if part != nil {
 			parts = append(parts, part)
 		}
 	}
 	return parts
 }
--- a/internal/httpapi/openai/chat/chat_history.go
+++ b/internal/httpapi/openai/chat/chat_history.go
@@ -14,9 +14,6 @@ import (
 	"ds2api/internal/promptcompat"
 )
 const adminWebUISourceHeader = "X-Ds2-Source"
 const adminWebUISourceValue = "admin-webui-api-tester"
 type chatHistorySession struct {
 	store       *chathistory.Store
 	entryID     string
@@ -40,6 +37,7 @@ func startChatHistory(store *chathistory.Store, r *http.Request, a *auth.Request
 	entry, err := store.Start(chathistory.StartParams{
 		CallerID:    strings.TrimSpace(a.CallerID),
 		AccountID:   strings.TrimSpace(a.AccountID),
 		Surface:     "openai.chat_completions",
 		Model:       strings.TrimSpace(stdReq.ResponseModel),
 		Stream:      stdReq.Stream,
 		UserInput:   extractSingleUserInput(stdReq.Messages),
@@ -50,6 +48,7 @@ func startChatHistory(store *chathistory.Store, r *http.Request, a *auth.Request
 	startParams := chathistory.StartParams{
 		CallerID:    strings.TrimSpace(a.CallerID),
 		AccountID:   strings.TrimSpace(a.AccountID),
 		Surface:     "openai.chat_completions",
 		Model:       strings.TrimSpace(stdReq.ResponseModel),
 		Stream:      stdReq.Stream,
 		UserInput:   extractSingleUserInput(stdReq.Messages),
@@ -82,7 +81,7 @@ func shouldCaptureChatHistory(r *http.Request) bool {
 	if isVercelStreamPrepareRequest(r) || isVercelStreamReleaseRequest(r) {
 		return false
 	}
-	return strings.TrimSpace(r.Header.Get(adminWebUISourceHeader)) != adminWebUISourceValue
+	return true
 }
 func extractSingleUserInput(messages []any) string {
@@ -188,6 +187,23 @@ func (s *chatHistorySession) stopped(thinking, content, finishReason string) {
 	})
 }
 func historyTextForArchive(raw, visible string) string {
 	if strings.TrimSpace(raw) != "" {
 		return raw
 	}
 	return visible
 }
 func historyThinkingForArchive(raw, detection, visible string) string {
 	if strings.TrimSpace(raw) != "" {
 		return raw
 	}
 	if strings.TrimSpace(detection) != "" {
 		return detection
 	}
 	return visible
 }
 func (s *chatHistorySession) retryMissingEntry() bool {
 	if s == nil || s.store == nil || s.disabled {
 		return false
--- a/internal/httpapi/openai/chat/chat_history_test.go
+++ b/internal/httpapi/openai/chat/chat_history_test.go
@@ -6,6 +6,7 @@ import (
 	"net/http/httptest"
 	"os"
 	"path/filepath"
 	"strconv"
 	"strings"
 	"sync"
 	"testing"
@@ -57,7 +58,7 @@ func blockChatHistoryDetailDir(t *testing.T, detailDir string) func() {
 func TestChatCompletionsNonStreamPersistsHistory(t *testing.T) {
 	historyStore := newTestChatHistoryStore(t)
 	h := &Handler{
-		Store:       mockOpenAIConfig{wideInput: true},
+		Store:       mockOpenAIConfig{},
 		Auth:        streamStatusAuthStub{},
 		DS:          streamStatusDSStub{resp: makeOpenAISSEHTTPResponse(`data: {"p":"response/content","v":"hello world"}`, `data: [DONE]`)},
 		ChatHistory: historyStore,
@@ -102,6 +103,86 @@ func TestChatCompletionsNonStreamPersistsHistory(t *testing.T) {
 	}
 }
 func TestChatHistoryNonStreamArchivesRawToolCallMarkup(t *testing.T) {
 	historyStore := newTestChatHistoryStore(t)
 	entry, err := historyStore.Start(chathistory.StartParams{
 		CallerID:  "caller:test",
 		Model:     "deepseek-v4-flash",
 		UserInput: "call tool",
 	})
 	if err != nil {
 		t.Fatalf("start history failed: %v", err)
 	}
 	session := &chatHistorySession{
 		store:       historyStore,
 		entryID:     entry.ID,
 		startedAt:   time.Now(),
 		lastPersist: time.Now().Add(-time.Second),
 		finalPrompt: "call tool",
 	}
 	rawToolCall := `<tool_calls><invoke name="search"><parameter name="q">golang</parameter></invoke></tool_calls>`
 	h := &Handler{}
 	rec := httptest.NewRecorder()
 	resp := makeOpenAISSEHTTPResponse(`data: {"p":"response/content","v":`+strconv.Quote(rawToolCall)+`}`, `data: [DONE]`)
 	h.handleNonStream(rec, resp, "cid-tool-history", "deepseek-v4-flash", "prompt", 0, false, false, []string{"search"}, nil, session)
 	if rec.Code != http.StatusOK {
 		t.Fatalf("expected 200, got %d body=%s", rec.Code, rec.Body.String())
 	}
 	full, err := historyStore.Get(entry.ID)
 	if err != nil {
 		t.Fatalf("get detail failed: %v", err)
 	}
 	if full.Content != rawToolCall {
 		t.Fatalf("expected raw tool markup archived, got %q", full.Content)
 	}
 	if full.FinishReason != "tool_calls" {
 		t.Fatalf("expected tool_calls finish reason, got %#v", full.FinishReason)
 	}
 }
 func TestChatHistoryStreamArchivesRawToolCallMarkup(t *testing.T) {
 	historyStore := newTestChatHistoryStore(t)
 	entry, err := historyStore.Start(chathistory.StartParams{
 		CallerID:  "caller:test",
 		Model:     "deepseek-v4-flash",
 		Stream:    true,
 		UserInput: "call tool",
 	})
 	if err != nil {
 		t.Fatalf("start history failed: %v", err)
 	}
 	session := &chatHistorySession{
 		store:       historyStore,
 		entryID:     entry.ID,
 		startedAt:   time.Now(),
 		lastPersist: time.Now().Add(-time.Second),
 		finalPrompt: "call tool",
 	}
 	rawToolCall := `<tool_calls><invoke name="search"><parameter name="q">golang</parameter></invoke></tool_calls>`
 	h := &Handler{}
 	req := httptest.NewRequest(http.MethodPost, "/v1/chat/completions", nil)
 	rec := httptest.NewRecorder()
 	resp := makeOpenAISSEHTTPResponse(`data: {"p":"response/content","v":`+strconv.Quote(rawToolCall)+`}`, `data: [DONE]`)
 	h.handleStream(rec, req, resp, "cid-stream-tool-history", "deepseek-v4-flash", "prompt", 0, false, false, []string{"search"}, nil, session)
 	if rec.Code != http.StatusOK {
 		t.Fatalf("expected 200, got %d body=%s", rec.Code, rec.Body.String())
 	}
 	full, err := historyStore.Get(entry.ID)
 	if err != nil {
 		t.Fatalf("get detail failed: %v", err)
 	}
 	if full.Content != rawToolCall {
 		t.Fatalf("expected raw streamed tool markup archived, got %q", full.Content)
 	}
 	if full.FinishReason != "tool_calls" {
 		t.Fatalf("expected tool_calls finish reason, got %#v", full.FinishReason)
 	}
 }
 func TestStartChatHistoryRecoversFromTransientWriteFailure(t *testing.T) {
 	historyStore := newTestChatHistoryStore(t)
 	restore := blockChatHistoryDetailDir(t, historyStore.DetailDir())
@@ -126,6 +207,7 @@ func TestStartChatHistoryRecoversFromTransientWriteFailure(t *testing.T) {
 	session := startChatHistory(historyStore, req, a, stdReq)
 	if session == nil {
 		t.Fatalf("expected session even when initial persistence fails")
 		return
 	}
 	if session.disabled {
 		t.Fatalf("expected session to remain active after transient start failure")
@@ -194,7 +276,7 @@ func TestHandleStreamContextCancelledMarksHistoryStopped(t *testing.T) {
 	rec := httptest.NewRecorder()
 	resp := makeOpenAISSEHTTPResponse(`data: {"p":"response/content","v":"hello"}`, `data: [DONE]`)
-	h.handleStream(rec, req, resp, "cid-stop", "deepseek-v4-flash", "prompt", false, false, nil, nil, session)
+	h.handleStream(rec, req, resp, "cid-stop", "deepseek-v4-flash", "prompt", 0, false, false, nil, nil, session)
 	snapshot, err := historyStore.Snapshot()
 	if err != nil {
@@ -212,10 +294,10 @@ func TestHandleStreamContextCancelledMarksHistoryStopped(t *testing.T) {
 	}
 }
-func TestChatCompletionsSkipsAdminWebUISource(t *testing.T) {
+func TestChatCompletionsRecordsAdminWebUISource(t *testing.T) {
 	historyStore := newTestChatHistoryStore(t)
 	h := &Handler{
-		Store:       mockOpenAIConfig{wideInput: true},
+		Store:       mockOpenAIConfig{},
 		Auth:        streamStatusAuthStub{},
 		DS:          streamStatusDSStub{resp: makeOpenAISSEHTTPResponse(`data: {"p":"response/content","v":"hello world"}`, `data: [DONE]`)},
 		ChatHistory: historyStore,
@@ -225,7 +307,7 @@ func TestChatCompletionsSkipsAdminWebUISource(t *testing.T) {
 	req := httptest.NewRequest(http.MethodPost, "/v1/chat/completions", strings.NewReader(reqBody))
 	req.Header.Set("Authorization", "Bearer direct-token")
 	req.Header.Set("Content-Type", "application/json")
-	req.Header.Set(adminWebUISourceHeader, adminWebUISourceValue)
+	req.Header.Set("X-Ds2-Source", "admin-webui-api-tester")
 	rec := httptest.NewRecorder()
 	h.ChatCompletions(rec, req)
@@ -236,8 +318,8 @@ func TestChatCompletionsSkipsAdminWebUISource(t *testing.T) {
 	if err != nil {
 		t.Fatalf("snapshot failed: %v", err)
 	}
-	if len(snapshot.Items) != 0 {
+	if len(snapshot.Items) != 1 {
-		t.Fatalf("expected admin webui source to be skipped, got %#v", snapshot.Items)
+		t.Fatalf("expected admin webui source to be recorded, got %#v", snapshot.Items)
 	}
 }
@@ -247,7 +329,7 @@ func TestChatCompletionsSkipsHistoryWhenDisabled(t *testing.T) {
 		t.Fatalf("disable history store failed: %v", err)
 	}
 	h := &Handler{
-		Store:       mockOpenAIConfig{wideInput: true},
+		Store:       mockOpenAIConfig{},
 		Auth:        streamStatusAuthStub{},
 		DS:          streamStatusDSStub{resp: makeOpenAISSEHTTPResponse(`data: {"p":"response/content","v":"hello world"}`, `data: [DONE]`)},
 		ChatHistory: historyStore,
@@ -277,7 +359,6 @@ func TestChatCompletionsCurrentInputFilePersistsNeutralPrompt(t *testing.T) {
 	ds := &inlineUploadDSStub{}
 	h := &Handler{
 		Store: mockOpenAIConfig{
 			wideInput:           true,
 			currentInputEnabled: true,
 		},
 		Auth:        streamStatusAuthStub{},
@@ -310,16 +391,16 @@ func TestChatCompletionsCurrentInputFilePersistsNeutralPrompt(t *testing.T) {
 	if len(ds.uploadCalls) != 1 {
 		t.Fatalf("expected current input upload to happen, got %d", len(ds.uploadCalls))
 	}
-	if ds.uploadCalls[0].Filename != "IGNORE.txt" {
+	if ds.uploadCalls[0].Filename != "DS2API_HISTORY.txt" {
-		t.Fatalf("expected IGNORE.txt upload, got %q", ds.uploadCalls[0].Filename)
+		t.Fatalf("expected DS2API_HISTORY.txt upload, got %q", ds.uploadCalls[0].Filename)
 	}
 	if full.HistoryText != string(ds.uploadCalls[0].Data) {
 		t.Fatalf("expected uploaded current input file to be persisted in history text")
 	}
 	if len(full.Messages) != 1 {
-		t.Fatalf("expected neutral prompt to be the only persisted message, got %#v", full.Messages)
+		t.Fatalf("expected continuation prompt to be the only persisted message, got %#v", full.Messages)
 	}
-	if !strings.Contains(full.Messages[0].Content, "Answer the latest user request directly.") {
+	if !strings.Contains(full.Messages[0].Content, "Continue from the latest state in the attached DS2API_HISTORY.txt context.") {
-		t.Fatalf("expected neutral prompt to be persisted, got %#v", full.Messages[0])
+		t.Fatalf("expected continuation prompt to be persisted, got %#v", full.Messages[0])
 	}
 }
--- a/internal/httpapi/openai/chat/chat_stream_runtime.go
+++ b/internal/httpapi/openai/chat/chat_stream_runtime.go
@@ -5,7 +5,10 @@ import (
 	"net/http"
 	"strings"
 	"ds2api/internal/assistantturn"
 	openaifmt "ds2api/internal/format/openai"
 	"ds2api/internal/httpapi/openai/shared"
 	"ds2api/internal/promptcompat"
 	"ds2api/internal/sse"
 	streamengine "ds2api/internal/stream"
 	"ds2api/internal/toolstream"
@@ -16,12 +19,14 @@ type chatStreamRuntime struct {
 	rc       *http.ResponseController
 	canFlush bool
-	completionID string
+	completionID  string
-	created      int64
+	created       int64
-	model        string
+	model         string
-	finalPrompt  string
+	finalPrompt   string
-	toolNames    []string
+	refFileTokens int
-	toolsRaw     any
+	toolNames     []string
 	toolsRaw      any
 	toolChoice    promptcompat.ToolChoicePolicy
 	thinkingEnabled       bool
 	searchEnabled         bool
@@ -33,13 +38,11 @@ type chatStreamRuntime struct {
 	toolCallsEmitted     bool
 	toolCallsDoneEmitted bool
-	toolSieve             toolstream.State
+	toolSieve         toolstream.State
-	streamToolCallIDs     map[int]string
+	streamToolCallIDs map[int]string
-	streamToolNames       map[int]string
+	streamToolNames   map[int]string
-	thinking              strings.Builder
+	accumulator       shared.StreamAccumulator
-	toolDetectionThinking strings.Builder
+	responseMessageID int
 	text                  strings.Builder
 	responseMessageID     int
 	finalThinking     string
 	finalText         string
@@ -50,6 +53,32 @@ type chatStreamRuntime struct {
 	finalErrorCode    string
 }
 type chatDeltaBatch struct {
 	runtime *chatStreamRuntime
 	field   string
 	text    strings.Builder
 }
 func (b *chatDeltaBatch) append(field, text string) {
 	if text == "" {
 		return
 	}
 	if b.field != "" && b.field != field {
 		b.flush()
 	}
 	b.field = field
 	b.text.WriteString(text)
 }
 func (b *chatDeltaBatch) flush() {
 	if b.field == "" || b.text.Len() == 0 {
 		return
 	}
 	b.runtime.sendDelta(map[string]any{b.field: b.text.String()})
 	b.field = ""
 	b.text.Reset()
 }
 func newChatStreamRuntime(
 	w http.ResponseWriter,
 	rc *http.ResponseController,
@@ -63,6 +92,7 @@ func newChatStreamRuntime(
 	stripReferenceMarkers bool,
 	toolNames []string,
 	toolsRaw any,
 	toolChoice promptcompat.ToolChoicePolicy,
 	bufferToolContent bool,
 	emitEarlyToolDeltas bool,
 ) *chatStreamRuntime {
@@ -76,6 +106,7 @@ func newChatStreamRuntime(
 		finalPrompt:           finalPrompt,
 		toolNames:             toolNames,
 		toolsRaw:              toolsRaw,
 		toolChoice:            toolChoice,
 		thinkingEnabled:       thinkingEnabled,
 		searchEnabled:         searchEnabled,
 		stripReferenceMarkers: stripReferenceMarkers,
@@ -83,6 +114,11 @@ func newChatStreamRuntime(
 		emitEarlyToolDeltas:   emitEarlyToolDeltas,
 		streamToolCallIDs:     map[int]string{},
 		streamToolNames:       map[int]string{},
 		accumulator: shared.StreamAccumulator{
 			ThinkingEnabled:       thinkingEnabled,
 			SearchEnabled:         searchEnabled,
 			StripReferenceMarkers: stripReferenceMarkers,
 		},
 	}
 }
@@ -91,7 +127,13 @@ func (s *chatStreamRuntime) sendKeepAlive() {
 		return
 	}
 	_, _ = s.w.Write([]byte(": keep-alive\n\n"))
-	_ = s.rc.Flush()
+	s.sendChunk(openaifmt.BuildChatStreamChunk(
 		s.completionID,
 		s.created,
 		s.model,
 		[]map[string]any{},
 		nil,
 	))
 }
 func (s *chatStreamRuntime) sendChunk(v any) {
@@ -104,6 +146,23 @@ func (s *chatStreamRuntime) sendChunk(v any) {
 	}
 }
 func (s *chatStreamRuntime) sendDelta(delta map[string]any) {
 	if len(delta) == 0 {
 		return
 	}
 	if !s.firstChunkSent {
 		delta["role"] = "assistant"
 		s.firstChunkSent = true
 	}
 	s.sendChunk(openaifmt.BuildChatStreamChunk(
 		s.completionID,
 		s.created,
 		s.model,
 		[]map[string]any{openaifmt.BuildChatStreamDeltaChoice(0, delta)},
 		nil,
 	))
 }
 func (s *chatStreamRuntime) sendDone() {
 	_, _ = s.w.Write([]byte("data: [DONE]\n\n"))
 	if s.canFlush {
@@ -127,6 +186,33 @@ func (s *chatStreamRuntime) sendFailedChunk(status int, message, code string) {
 	s.sendDone()
 }
 func (s *chatStreamRuntime) markContextCancelled() {
 	s.finalErrorStatus = 499
 	s.finalErrorMessage = "request context cancelled"
 	s.finalErrorCode = string(streamengine.StopReasonContextCancelled)
 	s.finalThinking = s.accumulator.Thinking.String()
 	s.finalText = cleanVisibleOutput(s.accumulator.Text.String(), s.stripReferenceMarkers)
 	s.finalFinishReason = string(streamengine.StopReasonContextCancelled)
 }
 func (s *chatStreamRuntime) historyText() string {
 	if s == nil {
 		return ""
 	}
 	return historyTextForArchive(s.accumulator.RawText.String(), s.finalText)
 }
 func (s *chatStreamRuntime) historyThinking() string {
 	if s == nil {
 		return ""
 	}
 	return historyThinkingForArchive(
 		s.accumulator.RawThinking.String(),
 		s.accumulator.ToolDetectionThinking.String(),
 		s.finalThinking,
 	)
 }
 func (s *chatStreamRuntime) resetStreamToolCallState() {
 	s.streamToolCallIDs = map[int]string{}
 	s.streamToolNames = map[int]string{}
@@ -136,81 +222,66 @@ func (s *chatStreamRuntime) finalize(finishReason string, deferEmptyOutput bool)
 	s.finalErrorStatus = 0
 	s.finalErrorMessage = ""
 	s.finalErrorCode = ""
-	finalThinking := s.thinking.String()
+	finalThinking := s.accumulator.Thinking.String()
-	finalToolDetectionThinking := s.toolDetectionThinking.String()
+	finalToolDetectionThinking := s.accumulator.ToolDetectionThinking.String()
-	finalText := cleanVisibleOutput(s.text.String(), s.stripReferenceMarkers)
+	finalText := s.accumulator.Text.String()
-	s.finalThinking = finalThinking
+	turn := assistantturn.BuildTurnFromStreamSnapshot(assistantturn.StreamSnapshot{
-	s.finalText = finalText
+		RawText:               s.accumulator.RawText.String(),
-	detected := detectAssistantToolCalls(finalText, finalThinking, finalToolDetectionThinking, s.toolNames)
+		VisibleText:           finalText,
-	if len(detected.Calls) > 0 && !s.toolCallsDoneEmitted {
+		RawThinking:           s.accumulator.RawThinking.String(),
-		finishReason = "tool_calls"
+		VisibleThinking:       finalThinking,
-		delta := map[string]any{
+		DetectionThinking:     finalToolDetectionThinking,
-			"tool_calls": formatFinalStreamToolCallsWithStableIDs(detected.Calls, s.streamToolCallIDs, s.toolsRaw),
+		ContentFilter:         finishReason == "content_filter",
-		}
+		ResponseMessageID:     s.responseMessageID,
-		if !s.firstChunkSent {
+		AlreadyEmittedCalls:   s.toolCallsEmitted,
-			delta["role"] = "assistant"
+		AlreadyEmittedToolRaw: s.toolCallsDoneEmitted,
-			s.firstChunkSent = true
+	}, assistantturn.BuildOptions{
-		}
+		Model:                 s.model,
-		s.sendChunk(openaifmt.BuildChatStreamChunk(
+		Prompt:                s.finalPrompt,
-			s.completionID,
+		RefFileTokens:         s.refFileTokens,
-			s.created,
+		SearchEnabled:         s.searchEnabled,
-			s.model,
+		StripReferenceMarkers: s.stripReferenceMarkers,
-			[]map[string]any{openaifmt.BuildChatStreamDeltaChoice(0, delta)},
+		ToolNames:             s.toolNames,
-			nil,
+		ToolsRaw:              s.toolsRaw,
-		))
+		ToolChoice:            s.toolChoice,
 	})
 	s.finalThinking = turn.Thinking
 	s.finalText = turn.Text
 	if len(turn.ToolCalls) > 0 && !s.toolCallsDoneEmitted {
 		s.sendDelta(map[string]any{
 			"tool_calls": formatFinalStreamToolCallsWithStableIDs(turn.ToolCalls, s.streamToolCallIDs, s.toolsRaw),
 		})
 		s.toolCallsEmitted = true
 		s.toolCallsDoneEmitted = true
 	} else if s.bufferToolContent {
 		batch := chatDeltaBatch{runtime: s}
 		for _, evt := range toolstream.Flush(&s.toolSieve, s.toolNames) {
 			if len(evt.ToolCalls) > 0 {
-				finishReason = "tool_calls"
+				batch.flush()
 				s.toolCallsEmitted = true
 				s.toolCallsDoneEmitted = true
-				tcDelta := map[string]any{
+				s.sendDelta(map[string]any{
 					"tool_calls": formatFinalStreamToolCallsWithStableIDs(evt.ToolCalls, s.streamToolCallIDs, s.toolsRaw),
-				}
+				})
 				if !s.firstChunkSent {
 					tcDelta["role"] = "assistant"
 					s.firstChunkSent = true
 				}
 				s.sendChunk(openaifmt.BuildChatStreamChunk(
 					s.completionID,
 					s.created,
 					s.model,
 					[]map[string]any{openaifmt.BuildChatStreamDeltaChoice(0, tcDelta)},
 					nil,
 				))
 				s.resetStreamToolCallState()
 			}
 			if evt.Content == "" {
 				continue
 			}
 			cleaned := cleanVisibleOutput(evt.Content, s.stripReferenceMarkers)
-			if cleaned == "" {
+			if cleaned == "" || (s.searchEnabled && sse.IsCitation(cleaned)) {
 				continue
 			}
-			delta := map[string]any{
+			batch.append("content", cleaned)
 				"content": cleaned,
 			}
 			if !s.firstChunkSent {
 				delta["role"] = "assistant"
 				s.firstChunkSent = true
 			}
 			s.sendChunk(openaifmt.BuildChatStreamChunk(
 				s.completionID,
 				s.created,
 				s.model,
 				[]map[string]any{openaifmt.BuildChatStreamDeltaChoice(0, delta)},
 				nil,
 			))
 		}
 		batch.flush()
 	}
-	if len(detected.Calls) > 0 || s.toolCallsEmitted {
+	outcome := assistantturn.FinalizeTurn(turn, assistantturn.FinalizeOptions{
-		finishReason = "tool_calls"
+		AlreadyEmittedToolCalls: s.toolCallsEmitted || s.toolCallsDoneEmitted,
-	}
+	})
-	if len(detected.Calls) == 0 && !s.toolCallsEmitted && strings.TrimSpace(finalText) == "" {
+	if outcome.ShouldFail {
-		status, message, code := upstreamEmptyOutputDetail(finishReason == "content_filter", finalText, finalThinking)
+		status, message, code := outcome.Error.Status, outcome.Error.Message, outcome.Error.Code
 		if deferEmptyOutput {
 			s.finalErrorStatus = status
 			s.finalErrorMessage = message
@@ -220,14 +291,14 @@ func (s *chatStreamRuntime) finalize(finishReason string, deferEmptyOutput bool)
 		s.sendFailedChunk(status, message, code)
 		return true
 	}
-	usage := openaifmt.BuildChatUsage(s.finalPrompt, finalThinking, finalText)
+	usage := assistantturn.OpenAIChatUsage(turn)
-	s.finalFinishReason = finishReason
+	s.finalFinishReason = outcome.FinishReason
 	s.finalUsage = usage
 	s.sendChunk(openaifmt.BuildChatStreamChunk(
 		s.completionID,
 		s.created,
 		s.model,
-		[]map[string]any{openaifmt.BuildChatStreamFinishChoice(0, finishReason)},
+		[]map[string]any{openaifmt.BuildChatStreamFinishChoice(0, outcome.FinishReason)},
 		usage,
 	))
 	s.sendDone()
@@ -242,7 +313,7 @@ func (s *chatStreamRuntime) onParsed(parsed sse.LineResult) streamengine.ParsedD
 		s.responseMessageID = parsed.ResponseMessageID
 	}
 	if parsed.ContentFilter {
-		if strings.TrimSpace(s.text.String()) == "" {
+		if strings.TrimSpace(s.accumulator.Text.String()) == "" {
 			return streamengine.ParsedDecision{Stop: true, StopReason: streamengine.StopReason("content_filter")}
 		}
 		return streamengine.ParsedDecision{Stop: true, StopReason: streamengine.StopReasonHandlerRequested}
@@ -254,109 +325,65 @@ func (s *chatStreamRuntime) onParsed(parsed sse.LineResult) streamengine.ParsedD
 		return streamengine.ParsedDecision{Stop: true, StopReason: streamengine.StopReasonHandlerRequested}
 	}
-	newChoices := make([]map[string]any, 0, len(parsed.Parts))
+	batch := chatDeltaBatch{runtime: s}
-	contentSeen := false
+	accumulated := s.accumulator.Apply(parsed)
-	for _, p := range parsed.ToolDetectionThinkingParts {
+	for _, p := range accumulated.Parts {
 		trimmed := sse.TrimContinuationOverlap(s.toolDetectionThinking.String(), p.Text)
 		if trimmed != "" {
 			s.toolDetectionThinking.WriteString(trimmed)
 		}
 	}
 	for _, p := range parsed.Parts {
 		cleanedText := cleanVisibleOutput(p.Text, s.stripReferenceMarkers)
 		if s.searchEnabled && sse.IsCitation(cleanedText) {
 			continue
 		}
 		if cleanedText == "" {
 			continue
 		}
 		contentSeen = true
 		delta := map[string]any{}
 		if !s.firstChunkSent {
 			delta["role"] = "assistant"
 			s.firstChunkSent = true
 		}
 		if p.Type == "thinking" {
-			if s.thinkingEnabled {
+			batch.append("reasoning_content", p.VisibleText)
-				trimmed := sse.TrimContinuationOverlap(s.thinking.String(), cleanedText)
+			continue
-				if trimmed == "" {
+		}
 		if p.RawText == "" {
 			continue
 		}
 		if p.CitationOnly {
 			continue
 		}
 		if !s.bufferToolContent {
 			batch.append("content", p.VisibleText)
 		} else {
 			events := toolstream.ProcessChunk(&s.toolSieve, p.RawText, s.toolNames)
 			for _, evt := range events {
 				if len(evt.ToolCallDeltas) > 0 {
 					if !s.emitEarlyToolDeltas {
 						continue
 					}
 					filtered := filterIncrementalToolCallDeltasByAllowed(evt.ToolCallDeltas, s.streamToolNames)
 					if len(filtered) == 0 {
 						continue
 					}
 					formatted := formatIncrementalStreamToolCallDeltas(filtered, s.streamToolCallIDs)
 					if len(formatted) == 0 {
 						continue
 					}
 					batch.flush()
 					tcDelta := map[string]any{
 						"tool_calls": formatted,
 					}
 					s.toolCallsEmitted = true
 					s.sendDelta(tcDelta)
 					continue
 				}
-				s.thinking.WriteString(trimmed)
+				if len(evt.ToolCalls) > 0 {
-				delta["reasoning_content"] = trimmed
+					batch.flush()
-			}
+					s.toolCallsEmitted = true
-		} else {
+					s.toolCallsDoneEmitted = true
-			trimmed := sse.TrimContinuationOverlap(s.text.String(), cleanedText)
+					tcDelta := map[string]any{
-			if trimmed == "" {
+						"tool_calls": formatFinalStreamToolCallsWithStableIDs(evt.ToolCalls, s.streamToolCallIDs, s.toolsRaw),
-				continue
+					}
-			}
+					s.sendDelta(tcDelta)
-			s.text.WriteString(trimmed)
+					s.resetStreamToolCallState()
-			if !s.bufferToolContent {
+					continue
-				delta["content"] = trimmed
+				}
-			} else {
+				if evt.Content != "" {
-				events := toolstream.ProcessChunk(&s.toolSieve, trimmed, s.toolNames)
+					cleaned := cleanVisibleOutput(evt.Content, s.stripReferenceMarkers)
-				for _, evt := range events {
+					if cleaned == "" || (s.searchEnabled && sse.IsCitation(cleaned)) {
 					if len(evt.ToolCallDeltas) > 0 {
 						if !s.emitEarlyToolDeltas {
 							continue
 						}
 						filtered := filterIncrementalToolCallDeltasByAllowed(evt.ToolCallDeltas, s.streamToolNames)
 						if len(filtered) == 0 {
 							continue
 						}
 						formatted := formatIncrementalStreamToolCallDeltas(filtered, s.streamToolCallIDs)
 						if len(formatted) == 0 {
 							continue
 						}
 						tcDelta := map[string]any{
 							"tool_calls": formatted,
 						}
 						s.toolCallsEmitted = true
 						if !s.firstChunkSent {
 							tcDelta["role"] = "assistant"
 							s.firstChunkSent = true
 						}
 						newChoices = append(newChoices, openaifmt.BuildChatStreamDeltaChoice(0, tcDelta))
 						continue
 					}
-					if len(evt.ToolCalls) > 0 {
+					batch.append("content", cleaned)
 						s.toolCallsEmitted = true
 						s.toolCallsDoneEmitted = true
 						tcDelta := map[string]any{
 							"tool_calls": formatFinalStreamToolCallsWithStableIDs(evt.ToolCalls, s.streamToolCallIDs, s.toolsRaw),
 						}
 						if !s.firstChunkSent {
 							tcDelta["role"] = "assistant"
 							s.firstChunkSent = true
 						}
 						newChoices = append(newChoices, openaifmt.BuildChatStreamDeltaChoice(0, tcDelta))
 						s.resetStreamToolCallState()
 						continue
 					}
 					if evt.Content != "" {
 						cleaned := cleanVisibleOutput(evt.Content, s.stripReferenceMarkers)
 						if cleaned == "" {
 							continue
 						}
 						contentDelta := map[string]any{
 							"content": cleaned,
 						}
 						if !s.firstChunkSent {
 							contentDelta["role"] = "assistant"
 							s.firstChunkSent = true
 						}
 						newChoices = append(newChoices, openaifmt.BuildChatStreamDeltaChoice(0, contentDelta))
 					}
 				}
 			}
 		}
 		if len(delta) > 0 {
 			newChoices = append(newChoices, openaifmt.BuildChatStreamDeltaChoice(0, delta))
 		}
 	}
-
+	batch.flush()
-	if len(newChoices) > 0 {
+	return streamengine.ParsedDecision{ContentSeen: accumulated.ContentSeen}
 		s.sendChunk(openaifmt.BuildChatStreamChunk(s.completionID, s.created, s.model, newChoices, nil))
 	}
 	return streamengine.ParsedDecision{ContentSeen: contentSeen}
 }
--- a/internal/httpapi/openai/chat/chat_stream_runtime_test.go
+++ b/internal/httpapi/openai/chat/chat_stream_runtime_test.go
@@ -0,0 +1,87 @@
 package chat
 import (
 	"net/http"
 	"net/http/httptest"
 	"strings"
 	"testing"
 	"time"
 	"ds2api/internal/promptcompat"
 )
 func TestChatStreamKeepAliveEmitsEmptyChoiceDataFrame(t *testing.T) {
 	rec := httptest.NewRecorder()
 	runtime := newChatStreamRuntime(
 		rec,
 		http.NewResponseController(rec),
 		true,
 		"chatcmpl-test",
 		time.Now().Unix(),
 		"deepseek-v4-flash",
 		"prompt",
 		false,
 		false,
 		true,
 		nil,
 		nil,
 		promptcompat.DefaultToolChoicePolicy(),
 		false,
 		false,
 	)
 	runtime.sendKeepAlive()
 	body := rec.Body.String()
 	if !strings.Contains(body, ": keep-alive\n\n") {
 		t.Fatalf("expected keep-alive comment, got %q", body)
 	}
 	frames, done := parseSSEDataFrames(t, body)
 	if done {
 		t.Fatalf("keep-alive must not emit [DONE], body=%q", body)
 	}
 	if len(frames) != 1 {
 		t.Fatalf("expected one data frame, got %d body=%q", len(frames), body)
 	}
 	if got := asString(frames[0]["id"]); got != "chatcmpl-test" {
 		t.Fatalf("expected completion id to be preserved, got %q", got)
 	}
 	if got := asString(frames[0]["object"]); got != "chat.completion.chunk" {
 		t.Fatalf("expected chat chunk object, got %q", got)
 	}
 	choices, _ := frames[0]["choices"].([]any)
 	if len(choices) != 0 {
 		t.Fatalf("expected empty choices heartbeat, got %#v", choices)
 	}
 }
 func TestChatStreamFinalizeEnforcesRequiredToolChoice(t *testing.T) {
 	rec := httptest.NewRecorder()
 	runtime := newChatStreamRuntime(
 		rec,
 		http.NewResponseController(rec),
 		true,
 		"chatcmpl-test",
 		time.Now().Unix(),
 		"deepseek-v4-flash",
 		"prompt",
 		false,
 		false,
 		true,
 		[]string{"Write"},
 		nil,
 		promptcompat.ToolChoicePolicy{Mode: promptcompat.ToolChoiceRequired},
 		true,
 		false,
 	)
 	if !runtime.finalize("stop", false) {
 		t.Fatalf("expected terminal error to be written")
 	}
 	if runtime.finalErrorCode != "tool_choice_violation" {
 		t.Fatalf("expected tool_choice_violation, got %q body=%s", runtime.finalErrorCode, rec.Body.String())
 	}
 	if !strings.Contains(rec.Body.String(), "tool_choice requires") {
 		t.Fatalf("expected tool choice error in stream body, got %s", rec.Body.String())
 	}
 }
--- a/internal/httpapi/openai/chat/empty_retry_runtime.go
+++ b/internal/httpapi/openai/chat/empty_retry_runtime.go
@@ -7,15 +7,19 @@ import (
 	"strings"
 	"time"
 	"ds2api/internal/assistantturn"
 	"ds2api/internal/auth"
 	"ds2api/internal/config"
 	dsprotocol "ds2api/internal/deepseek/protocol"
 	openaifmt "ds2api/internal/format/openai"
 	"ds2api/internal/promptcompat"
 	"ds2api/internal/sse"
 	streamengine "ds2api/internal/stream"
 )
 type chatNonStreamResult struct {
 	rawThinking           string
 	rawText               string
 	thinking              string
 	toolDetectionThinking string
 	text                  string
@@ -24,13 +28,23 @@ type chatNonStreamResult struct {
 	body                  map[string]any
 	finishReason          string
 	responseMessageID     int
 	outputError           *assistantturn.OutputError
 }
-func (h *Handler) handleNonStreamWithRetry(w http.ResponseWriter, ctx context.Context, a *auth.RequestAuth, resp *http.Response, payload map[string]any, pow, completionID, model, finalPrompt string, thinkingEnabled, searchEnabled bool, toolNames []string, toolsRaw any, historySession *chatHistorySession) {
+func (r chatNonStreamResult) historyText() string {
 	return historyTextForArchive(r.rawText, r.text)
 }
 func (r chatNonStreamResult) historyThinking() string {
 	return historyThinkingForArchive(r.rawThinking, r.toolDetectionThinking, r.thinking)
 }
 func (h *Handler) handleNonStreamWithRetry(w http.ResponseWriter, ctx context.Context, a *auth.RequestAuth, resp *http.Response, payload map[string]any, pow, completionID, model, finalPrompt string, refFileTokens int, thinkingEnabled, searchEnabled bool, toolNames []string, toolsRaw any, historySession *chatHistorySession) {
 	attempts := 0
 	currentResp := resp
 	usagePrompt := finalPrompt
 	accumulatedThinking := ""
 	accumulatedRawThinking := ""
 	accumulatedToolDetectionThinking := ""
 	for {
 		result, ok := h.collectChatNonStreamAttempt(w, currentResp, completionID, model, usagePrompt, thinkingEnabled, searchEnabled, toolNames, toolsRaw)
@@ -38,15 +52,18 @@ func (h *Handler) handleNonStreamWithRetry(w http.ResponseWriter, ctx context.Co
 			return
 		}
 		accumulatedThinking += sse.TrimContinuationOverlap(accumulatedThinking, result.thinking)
 		accumulatedRawThinking += sse.TrimContinuationOverlap(accumulatedRawThinking, result.rawThinking)
 		accumulatedToolDetectionThinking += sse.TrimContinuationOverlap(accumulatedToolDetectionThinking, result.toolDetectionThinking)
 		result.thinking = accumulatedThinking
 		result.rawThinking = accumulatedRawThinking
 		result.toolDetectionThinking = accumulatedToolDetectionThinking
-		detected := detectAssistantToolCalls(result.text, result.thinking, result.toolDetectionThinking, toolNames)
+		detected := detectAssistantToolCalls(result.rawText, result.text, result.rawThinking, result.toolDetectionThinking, toolNames)
 		result.detectedCalls = len(detected.Calls)
 		result.body = openaifmt.BuildChatCompletionWithToolCalls(completionID, model, usagePrompt, result.thinking, result.text, detected.Calls, toolsRaw)
 		addRefFileTokensToUsage(result.body, refFileTokens)
 		result.finishReason = chatFinishReason(result.body)
 		if !shouldRetryChatNonStream(result, attempts) {
-			h.finishChatNonStreamResult(w, result, attempts, usagePrompt, historySession)
+			h.finishChatNonStreamResult(w, result, attempts, usagePrompt, refFileTokens, historySession)
 			return
 		}
@@ -61,13 +78,13 @@ func (h *Handler) handleNonStreamWithRetry(w http.ResponseWriter, ctx context.Co
 		nextResp, err := h.DS.CallCompletion(ctx, a, retryPayload, retryPow, 3)
 		if err != nil {
 			if historySession != nil {
-				historySession.error(http.StatusInternalServerError, "Failed to get completion.", "error", result.thinking, result.text)
+				historySession.error(http.StatusInternalServerError, "Failed to get completion.", "error", result.historyThinking(), result.historyText())
 			}
 			writeOpenAIError(w, http.StatusInternalServerError, "Failed to get completion.")
 			config.Logger.Warn("[openai_empty_retry] retry request failed", "surface", "chat.completions", "stream", false, "retry_attempt", attempts, "error", err)
 			return
 		}
-		usagePrompt = usagePromptWithEmptyOutputRetry(finalPrompt, attempts)
+		usagePrompt = usagePromptWithEmptyOutputRetry(usagePrompt, attempts)
 		currentResp = nextResp
 	}
 }
@@ -80,39 +97,44 @@ func (h *Handler) collectChatNonStreamAttempt(w http.ResponseWriter, resp *http.
 		return chatNonStreamResult{}, false
 	}
 	result := sse.CollectStream(resp, thinkingEnabled, true)
-	stripReferenceMarkers := h.compatStripReferenceMarkers()
+	turn := assistantturn.BuildTurnFromCollected(result, assistantturn.BuildOptions{
-	finalThinking := cleanVisibleOutput(result.Thinking, stripReferenceMarkers)
+		Model:         model,
-	finalToolDetectionThinking := cleanVisibleOutput(result.ToolDetectionThinking, stripReferenceMarkers)
+		Prompt:        usagePrompt,
-	finalText := cleanVisibleOutput(result.Text, stripReferenceMarkers)
+		SearchEnabled: searchEnabled,
-	if searchEnabled {
+		ToolNames:     toolNames,
-		finalText = replaceCitationMarkersWithLinks(finalText, result.CitationLinks)
+		ToolsRaw:      toolsRaw,
-	}
+	})
-	detected := detectAssistantToolCalls(finalText, finalThinking, finalToolDetectionThinking, toolNames)
+	respBody := openaifmt.BuildChatCompletionWithToolCalls(completionID, model, usagePrompt, turn.Thinking, turn.Text, turn.ToolCalls, toolsRaw)
 	respBody := openaifmt.BuildChatCompletionWithToolCalls(completionID, model, usagePrompt, finalThinking, finalText, detected.Calls, toolsRaw)
 	return chatNonStreamResult{
-		thinking:              finalThinking,
+		rawThinking:           result.Thinking,
-		toolDetectionThinking: finalToolDetectionThinking,
+		rawText:               result.Text,
-		text:                  finalText,
+		thinking:              turn.Thinking,
 		toolDetectionThinking: result.ToolDetectionThinking,
 		text:                  turn.Text,
 		contentFilter:         result.ContentFilter,
-		detectedCalls:         len(detected.Calls),
+		detectedCalls:         len(turn.ToolCalls),
 		body:                  respBody,
 		finishReason:          chatFinishReason(respBody),
 		responseMessageID:     result.ResponseMessageID,
 		outputError:           turn.Error,
 	}, true
 }
-func (h *Handler) finishChatNonStreamResult(w http.ResponseWriter, result chatNonStreamResult, attempts int, usagePrompt string, historySession *chatHistorySession) {
+func (h *Handler) finishChatNonStreamResult(w http.ResponseWriter, result chatNonStreamResult, attempts int, usagePrompt string, refFileTokens int, historySession *chatHistorySession) {
-	if result.detectedCalls == 0 && shouldWriteUpstreamEmptyOutputError(result.text) {
+	if result.detectedCalls == 0 && strings.TrimSpace(result.text) == "" {
 		status, message, code := upstreamEmptyOutputDetail(result.contentFilter, result.text, result.thinking)
-		if historySession != nil {
+		if result.outputError != nil {
-			historySession.error(status, message, code, result.thinking, result.text)
+			status, message, code = result.outputError.Status, result.outputError.Message, result.outputError.Code
 		}
-		writeUpstreamEmptyOutputError(w, result.text, result.thinking, result.contentFilter)
+		if historySession != nil {
 			historySession.error(status, message, code, result.historyThinking(), result.historyText())
 		}
 		writeOpenAIErrorWithCode(w, status, message, code)
 		config.Logger.Info("[openai_empty_retry] terminal empty output", "surface", "chat.completions", "stream", false, "retry_attempts", attempts, "success_source", "none", "content_filter", result.contentFilter)
 		return
 	}
 	if historySession != nil {
-		historySession.success(http.StatusOK, result.thinking, result.text, result.finishReason, openaifmt.BuildChatUsage(usagePrompt, result.thinking, result.text))
+		historySession.success(http.StatusOK, result.historyThinking(), result.historyText(), result.finishReason, openaifmt.BuildChatUsageForModel("", usagePrompt, result.thinking, result.text, refFileTokens))
 	}
 	writeJSON(w, http.StatusOK, result.body)
 	source := "first_attempt"
@@ -139,8 +161,8 @@ func shouldRetryChatNonStream(result chatNonStreamResult, attempts int) bool {
 		strings.TrimSpace(result.text) == ""
 }
-func (h *Handler) handleStreamWithRetry(w http.ResponseWriter, r *http.Request, a *auth.RequestAuth, resp *http.Response, payload map[string]any, pow, completionID, model, finalPrompt string, thinkingEnabled, searchEnabled bool, toolNames []string, toolsRaw any, historySession *chatHistorySession) {
+func (h *Handler) handleStreamWithRetry(w http.ResponseWriter, r *http.Request, a *auth.RequestAuth, resp *http.Response, payload map[string]any, pow, completionID, model, finalPrompt string, refFileTokens int, thinkingEnabled, searchEnabled bool, toolNames []string, toolsRaw any, toolChoice promptcompat.ToolChoicePolicy, historySession *chatHistorySession) {
-	streamRuntime, initialType, ok := h.prepareChatStreamRuntime(w, resp, completionID, model, finalPrompt, thinkingEnabled, searchEnabled, toolNames, toolsRaw, historySession)
+	streamRuntime, initialType, ok := h.prepareChatStreamRuntime(w, resp, completionID, model, finalPrompt, refFileTokens, thinkingEnabled, searchEnabled, toolNames, toolsRaw, toolChoice, historySession)
 	if !ok {
 		return
 	}
@@ -182,7 +204,7 @@ func (h *Handler) handleStreamWithRetry(w http.ResponseWriter, r *http.Request,
 	}
 }
-func (h *Handler) prepareChatStreamRuntime(w http.ResponseWriter, resp *http.Response, completionID, model, finalPrompt string, thinkingEnabled, searchEnabled bool, toolNames []string, toolsRaw any, historySession *chatHistorySession) (*chatStreamRuntime, string, bool) {
+func (h *Handler) prepareChatStreamRuntime(w http.ResponseWriter, resp *http.Response, completionID, model, finalPrompt string, refFileTokens int, thinkingEnabled, searchEnabled bool, toolNames []string, toolsRaw any, toolChoice promptcompat.ToolChoicePolicy, historySession *chatHistorySession) (*chatStreamRuntime, string, bool) {
 	if resp.StatusCode != http.StatusOK {
 		defer func() { _ = resp.Body.Close() }()
 		body, _ := io.ReadAll(resp.Body)
@@ -207,9 +229,11 @@ func (h *Handler) prepareChatStreamRuntime(w http.ResponseWriter, resp *http.Res
 	}
 	streamRuntime := newChatStreamRuntime(
 		w, rc, canFlush, completionID, time.Now().Unix(), model, finalPrompt,
-		thinkingEnabled, searchEnabled, h.compatStripReferenceMarkers(), toolNames, toolsRaw,
+		thinkingEnabled, searchEnabled, stripReferenceMarkersEnabled(), toolNames, toolsRaw,
 		toolChoice,
 		len(toolNames) > 0, h.toolcallFeatureMatchEnabled() && h.toolcallEarlyEmitHighConfidence(),
 	)
 	streamRuntime.refFileTokens = refFileTokens
 	return streamRuntime, initialType, true
 }
@@ -229,7 +253,7 @@ func (h *Handler) consumeChatStreamAttempt(r *http.Request, resp *http.Response,
 		OnParsed: func(parsed sse.LineResult) streamengine.ParsedDecision {
 			decision := streamRuntime.onParsed(parsed)
 			if historySession != nil {
-				historySession.progress(streamRuntime.thinking.String(), streamRuntime.text.String())
+				historySession.progress(streamRuntime.historyThinking(), streamRuntime.historyText())
 			}
 			return decision
 		},
@@ -239,11 +263,15 @@ func (h *Handler) consumeChatStreamAttempt(r *http.Request, resp *http.Response,
 			}
 		},
 		OnContextDone: func() {
 			streamRuntime.markContextCancelled()
 			if historySession != nil {
-				historySession.stopped(streamRuntime.thinking.String(), streamRuntime.text.String(), string(streamengine.StopReasonContextCancelled))
+				historySession.stopped(streamRuntime.historyThinking(), streamRuntime.historyText(), string(streamengine.StopReasonContextCancelled))
 			}
 		},
 	})
 	if streamRuntime.finalErrorCode == string(streamengine.StopReasonContextCancelled) {
 		return true, false
 	}
 	terminalWritten := streamRuntime.finalize(finalReason, allowDeferEmpty && finalReason != "content_filter")
 	if terminalWritten {
 		recordChatStreamHistory(streamRuntime, historySession)
@@ -257,16 +285,16 @@ func recordChatStreamHistory(streamRuntime *chatStreamRuntime, historySession *c
 		return
 	}
 	if streamRuntime.finalErrorMessage != "" {
-		historySession.error(streamRuntime.finalErrorStatus, streamRuntime.finalErrorMessage, streamRuntime.finalErrorCode, streamRuntime.thinking.String(), streamRuntime.text.String())
+		historySession.error(streamRuntime.finalErrorStatus, streamRuntime.finalErrorMessage, streamRuntime.finalErrorCode, streamRuntime.historyThinking(), streamRuntime.historyText())
 		return
 	}
-	historySession.success(http.StatusOK, streamRuntime.finalThinking, streamRuntime.finalText, streamRuntime.finalFinishReason, streamRuntime.finalUsage)
+	historySession.success(http.StatusOK, streamRuntime.historyThinking(), streamRuntime.historyText(), streamRuntime.finalFinishReason, streamRuntime.finalUsage)
 }
 func failChatStreamRetry(streamRuntime *chatStreamRuntime, historySession *chatHistorySession, status int, message, code string) {
 	streamRuntime.sendFailedChunk(status, message, code)
 	if historySession != nil {
-		historySession.error(status, message, code, streamRuntime.thinking.String(), streamRuntime.text.String())
+		historySession.error(status, message, code, streamRuntime.historyThinking(), streamRuntime.historyText())
 	}
 }
@@ -275,6 +303,10 @@ func logChatStreamTerminal(streamRuntime *chatStreamRuntime, attempts int) {
 	if attempts > 0 {
 		source = "synthetic_retry"
 	}
 	if streamRuntime.finalErrorCode == string(streamengine.StopReasonContextCancelled) {
 		config.Logger.Info("[openai_empty_retry] terminal cancelled", "surface", "chat.completions", "stream", true, "retry_attempts", attempts, "error_code", streamRuntime.finalErrorCode)
 		return
 	}
 	if streamRuntime.finalErrorMessage != "" {
 		config.Logger.Info("[openai_empty_retry] terminal empty output", "surface", "chat.completions", "stream", true, "retry_attempts", attempts, "success_source", "none", "error_code", streamRuntime.finalErrorCode)
 		return
--- a/internal/httpapi/openai/chat/empty_retry_runtime_test.go
+++ b/internal/httpapi/openai/chat/empty_retry_runtime_test.go
@@ -0,0 +1,87 @@
 package chat
 import (
 	"context"
 	"net/http"
 	"net/http/httptest"
 	"testing"
 	"time"
 	"ds2api/internal/chathistory"
 	"ds2api/internal/promptcompat"
 	"ds2api/internal/stream"
 )
 func TestConsumeChatStreamAttemptMarksContextCancelledState(t *testing.T) {
 	historyStore := newTestChatHistoryStore(t)
 	entry, err := historyStore.Start(chathistory.StartParams{
 		CallerID:  "caller:test",
 		Model:     "deepseek-v4-flash",
 		Stream:    true,
 		UserInput: "hello",
 	})
 	if err != nil {
 		t.Fatalf("start history failed: %v", err)
 	}
 	session := &chatHistorySession{
 		store:       historyStore,
 		entryID:     entry.ID,
 		startedAt:   time.Now(),
 		lastPersist: time.Now(),
 		finalPrompt: "prompt",
 	}
 	ctx, cancel := context.WithCancel(context.Background())
 	cancel()
 	req := httptest.NewRequest(http.MethodPost, "/v1/chat/completions", nil).WithContext(ctx)
 	rec := httptest.NewRecorder()
 	streamRuntime := newChatStreamRuntime(
 		rec,
 		http.NewResponseController(rec),
 		true,
 		"cid-cancelled",
 		time.Now().Unix(),
 		"deepseek-v4-flash",
 		"prompt",
 		false,
 		false,
 		true,
 		nil,
 		nil,
 		promptcompat.DefaultToolChoicePolicy(),
 		false,
 		false,
 	)
 	resp := makeOpenAISSEHTTPResponse(
 		`data: {"p":"response/content","v":"hello"}`,
 		`data: [DONE]`,
 	)
 	h := &Handler{}
 	terminalWritten, retryable := h.consumeChatStreamAttempt(req, resp, streamRuntime, "text", false, session, true)
 	if !terminalWritten || retryable {
 		t.Fatalf("expected cancelled attempt to terminate without retry, got terminalWritten=%v retryable=%v", terminalWritten, retryable)
 	}
 	if got, want := streamRuntime.finalErrorCode, string(stream.StopReasonContextCancelled); got != want {
 		t.Fatalf("expected cancelled final error code %q, got %q", want, got)
 	}
 	if streamRuntime.finalErrorMessage == "" {
 		t.Fatalf("expected cancelled final error message to be preserved")
 	}
 	snapshot, err := historyStore.Snapshot()
 	if err != nil {
 		t.Fatalf("snapshot failed: %v", err)
 	}
 	if len(snapshot.Items) != 1 {
 		t.Fatalf("expected one history item, got %d", len(snapshot.Items))
 	}
 	full, err := historyStore.Get(snapshot.Items[0].ID)
 	if err != nil {
 		t.Fatalf("get detail failed: %v", err)
 	}
 	if full.Status != "stopped" {
 		t.Fatalf("expected stopped status, got %#v", full)
 	}
 }
--- a/internal/httpapi/openai/chat/handler.go
+++ b/internal/httpapi/openai/chat/handler.go
@@ -12,6 +12,7 @@ import (
 	"ds2api/internal/httpapi/openai/history"
 	"ds2api/internal/httpapi/openai/shared"
 	"ds2api/internal/promptcompat"
 	"ds2api/internal/textclean"
 	"ds2api/internal/toolcall"
 	"ds2api/internal/toolstream"
 )
@@ -35,11 +36,8 @@ type streamLease struct {
 	ExpiresAt time.Time
 }
-func (h *Handler) compatStripReferenceMarkers() bool {
+func stripReferenceMarkersEnabled() bool {
-	if h == nil {
+	return textclean.StripReferenceMarkersEnabled()
 		return true
 	}
 	return shared.CompatStripReferenceMarkers(h.Store)
 }
 func (h *Handler) applyCurrentInputFile(ctx context.Context, a *auth.RequestAuth, stdReq promptcompat.StandardRequest) (promptcompat.StandardRequest, error) {
@@ -80,6 +78,10 @@ func writeOpenAIError(w http.ResponseWriter, status int, message string) {
 	shared.WriteOpenAIError(w, status, message)
 }
 func writeOpenAIErrorWithCode(w http.ResponseWriter, status int, message, code string) {
 	shared.WriteOpenAIErrorWithCode(w, status, message, code)
 }
 func openAIErrorType(status int) string {
 	return shared.OpenAIErrorType(status)
 }
@@ -104,22 +106,10 @@ func cleanVisibleOutput(text string, stripReferenceMarkers bool) string {
 	return shared.CleanVisibleOutput(text, stripReferenceMarkers)
 }
 func replaceCitationMarkersWithLinks(text string, links map[int]string) string {
 	return shared.ReplaceCitationMarkersWithLinks(text, links)
 }
 func shouldWriteUpstreamEmptyOutputError(text string) bool {
 	return shared.ShouldWriteUpstreamEmptyOutputError(text)
 }
 func upstreamEmptyOutputDetail(contentFilter bool, text, thinking string) (int, string, string) {
 	return shared.UpstreamEmptyOutputDetail(contentFilter, text, thinking)
 }
 func writeUpstreamEmptyOutputError(w http.ResponseWriter, text, thinking string, contentFilter bool) bool {
 	return shared.WriteUpstreamEmptyOutputError(w, text, thinking, contentFilter)
 }
 func emptyOutputRetryEnabled() bool {
 	return shared.EmptyOutputRetryEnabled()
 }
@@ -148,6 +138,6 @@ func formatFinalStreamToolCallsWithStableIDs(calls []toolcall.ParsedToolCall, id
 	return shared.FormatFinalStreamToolCallsWithStableIDs(calls, ids, toolsRaw)
 }
-func detectAssistantToolCalls(text, exposedThinking, detectionThinking string, toolNames []string) toolcall.ToolCallParseResult {
+func detectAssistantToolCalls(rawText, visibleText, exposedThinking, detectionThinking string, toolNames []string) toolcall.ToolCallParseResult {
-	return shared.DetectAssistantToolCalls(text, exposedThinking, detectionThinking, toolNames)
+	return shared.DetectAssistantToolCalls(rawText, visibleText, exposedThinking, detectionThinking, toolNames)
 }
--- a/internal/httpapi/openai/chat/handler_chat.go
+++ b/internal/httpapi/openai/chat/handler_chat.go
@@ -8,7 +8,9 @@ import (
 	"strings"
 	"time"
 	"ds2api/internal/assistantturn"
 	"ds2api/internal/auth"
 	"ds2api/internal/completionruntime"
 	"ds2api/internal/config"
 	dsprotocol "ds2api/internal/deepseek/protocol"
 	openaifmt "ds2api/internal/format/openai"
@@ -76,43 +78,43 @@ func (h *Handler) ChatCompletions(w http.ResponseWriter, r *http.Request) {
 	}
 	historySession := startChatHistory(h.ChatHistory, r, a, stdReq)
-	sessionID, err = h.DS.CreateSession(r.Context(), a, 3)
+	if !stdReq.Stream {
-	if err != nil {
+		result, outErr := completionruntime.ExecuteNonStreamWithRetry(r.Context(), h.DS, a, stdReq, completionruntime.Options{
-		if a.UseConfigToken {
+			RetryEnabled:     true,
 			CurrentInputFile: h.Store,
 		})
 		sessionID = result.SessionID
 		if outErr != nil {
 			if historySession != nil {
-				historySession.error(http.StatusUnauthorized, "Account token is invalid. Please re-login the account in admin.", "error", "", "")
+				historySession.error(outErr.Status, outErr.Message, outErr.Code, historyThinkingForArchive(result.Turn.RawThinking, result.Turn.DetectionThinking, result.Turn.Thinking), historyTextForArchive(result.Turn.RawText, result.Turn.Text))
 			}
-			writeOpenAIError(w, http.StatusUnauthorized, "Account token is invalid. Please re-login the account in admin.")
+			writeOpenAIErrorWithCode(w, outErr.Status, outErr.Message, outErr.Code)
-		} else {
+			return
 			if historySession != nil {
 				historySession.error(http.StatusUnauthorized, "Invalid token. If this should be a DS2API key, add it to config.keys first.", "error", "", "")
 			}
 			writeOpenAIError(w, http.StatusUnauthorized, "Invalid token. If this should be a DS2API key, add it to config.keys first.")
 		}
-		return
+		respBody := openaifmt.BuildChatCompletionWithToolCalls(result.SessionID, stdReq.ResponseModel, result.Turn.Prompt, result.Turn.Thinking, result.Turn.Text, result.Turn.ToolCalls, stdReq.ToolsRaw)
-	}
+		respBody["usage"] = assistantturn.OpenAIChatUsage(result.Turn)
-	pow, err := h.DS.GetPow(r.Context(), a, 3)
+		finishReason := assistantturn.FinalizeTurn(result.Turn, assistantturn.FinalizeOptions{}).FinishReason
 	if err != nil {
 		if historySession != nil {
-			historySession.error(http.StatusUnauthorized, "Failed to get PoW (invalid token or unknown error).", "error", "", "")
+			historySession.success(http.StatusOK, historyThinkingForArchive(result.Turn.RawThinking, result.Turn.DetectionThinking, result.Turn.Thinking), historyTextForArchive(result.Turn.RawText, result.Turn.Text), finishReason, assistantturn.OpenAIChatUsage(result.Turn))
 		}
-		writeOpenAIError(w, http.StatusUnauthorized, "Failed to get PoW (invalid token or unknown error).")
+		writeJSON(w, http.StatusOK, respBody)
 		return
 	}
-	payload := stdReq.CompletionPayload(sessionID)
+
-	resp, err := h.DS.CallCompletion(r.Context(), a, payload, pow, 3)
+	start, outErr := completionruntime.StartCompletion(r.Context(), h.DS, a, stdReq, completionruntime.Options{
-	if err != nil {
+		CurrentInputFile: h.Store,
 	})
 	sessionID = start.SessionID
 	if outErr != nil {
 		if historySession != nil {
-			historySession.error(http.StatusInternalServerError, "Failed to get completion.", "error", "", "")
+			historySession.error(outErr.Status, outErr.Message, outErr.Code, "", "")
 		}
-		writeOpenAIError(w, http.StatusInternalServerError, "Failed to get completion.")
+		writeOpenAIErrorWithCode(w, outErr.Status, outErr.Message, outErr.Code)
 		return
 	}
-	if stdReq.Stream {
+	streamReq := start.Request
-		h.handleStreamWithRetry(w, r, a, resp, payload, pow, sessionID, stdReq.ResponseModel, stdReq.FinalPrompt, stdReq.Thinking, stdReq.Search, stdReq.ToolNames, stdReq.ToolsRaw, historySession)
+	refFileTokens := streamReq.RefFileTokens
-		return
+	h.handleStreamWithRetry(w, r, a, start.Response, start.Payload, start.Pow, sessionID, streamReq.ResponseModel, streamReq.PromptTokenText, refFileTokens, streamReq.Thinking, streamReq.Search, streamReq.ToolNames, streamReq.ToolsRaw, streamReq.ToolChoice, historySession)
 	}
 	h.handleNonStreamWithRetry(w, r.Context(), a, resp, payload, pow, sessionID, stdReq.ResponseModel, stdReq.FinalPrompt, stdReq.Thinking, stdReq.Search, stdReq.ToolNames, stdReq.ToolsRaw, historySession)
 }
 func (h *Handler) autoDeleteRemoteSession(ctx context.Context, a *auth.RequestAuth, sessionID string) {
@@ -148,7 +150,7 @@ func (h *Handler) autoDeleteRemoteSession(ctx context.Context, a *auth.RequestAu
 	}
 }
-func (h *Handler) handleNonStream(w http.ResponseWriter, resp *http.Response, completionID, model, finalPrompt string, thinkingEnabled, searchEnabled bool, toolNames []string, toolsRaw any, historySession *chatHistorySession) {
+func (h *Handler) handleNonStream(w http.ResponseWriter, resp *http.Response, completionID, model, finalPrompt string, refFileTokens int, thinkingEnabled, searchEnabled bool, toolNames []string, toolsRaw any, historySession *chatHistorySession) {
 	if resp.StatusCode != http.StatusOK {
 		defer func() { _ = resp.Body.Close() }()
 		body, _ := io.ReadAll(resp.Body)
@@ -160,36 +162,33 @@ func (h *Handler) handleNonStream(w http.ResponseWriter, resp *http.Response, co
 	}
 	result := sse.CollectStream(resp, thinkingEnabled, true)
-	stripReferenceMarkers := h.compatStripReferenceMarkers()
+	turn := assistantturn.BuildTurnFromCollected(result, assistantturn.BuildOptions{
-	finalThinking := cleanVisibleOutput(result.Thinking, stripReferenceMarkers)
+		Model:         model,
-	finalToolDetectionThinking := cleanVisibleOutput(result.ToolDetectionThinking, stripReferenceMarkers)
+		Prompt:        finalPrompt,
-	finalText := cleanVisibleOutput(result.Text, stripReferenceMarkers)
+		RefFileTokens: refFileTokens,
-	if searchEnabled {
+		SearchEnabled: searchEnabled,
-		finalText = replaceCitationMarkersWithLinks(finalText, result.CitationLinks)
+		ToolNames:     toolNames,
-	}
+		ToolsRaw:      toolsRaw,
-	detected := detectAssistantToolCalls(finalText, finalThinking, finalToolDetectionThinking, toolNames)
+		ToolChoice:    promptcompat.DefaultToolChoicePolicy(),
-	if shouldWriteUpstreamEmptyOutputError(finalText) && len(detected.Calls) == 0 {
+	})
-		status, message, code := upstreamEmptyOutputDetail(result.ContentFilter, finalText, finalThinking)
+	outcome := assistantturn.FinalizeTurn(turn, assistantturn.FinalizeOptions{})
 	if outcome.ShouldFail {
 		status, message, code := outcome.Error.Status, outcome.Error.Message, outcome.Error.Code
 		if historySession != nil {
-			historySession.error(status, message, code, finalThinking, finalText)
+			historySession.error(status, message, code, historyThinkingForArchive(turn.RawThinking, turn.DetectionThinking, turn.Thinking), historyTextForArchive(turn.RawText, turn.Text))
 		}
-		writeUpstreamEmptyOutputError(w, finalText, finalThinking, result.ContentFilter)
+		writeOpenAIErrorWithCode(w, status, message, code)
 		return
 	}
-	respBody := openaifmt.BuildChatCompletionWithToolCalls(completionID, model, finalPrompt, finalThinking, finalText, detected.Calls, toolsRaw)
+	respBody := openaifmt.BuildChatCompletionWithToolCalls(completionID, model, finalPrompt, turn.Thinking, turn.Text, turn.ToolCalls, toolsRaw)
-	finishReason := "stop"
+	respBody["usage"] = assistantturn.OpenAIChatUsage(turn)
 	if choices, ok := respBody["choices"].([]map[string]any); ok && len(choices) > 0 {
 		if fr, _ := choices[0]["finish_reason"].(string); strings.TrimSpace(fr) != "" {
 			finishReason = fr
 		}
 	}
 	if historySession != nil {
-		historySession.success(http.StatusOK, finalThinking, finalText, finishReason, openaifmt.BuildChatUsage(finalPrompt, finalThinking, finalText))
+		historySession.success(http.StatusOK, historyThinkingForArchive(turn.RawThinking, turn.DetectionThinking, turn.Thinking), historyTextForArchive(turn.RawText, turn.Text), outcome.FinishReason, assistantturn.OpenAIChatUsage(turn))
 	}
 	writeJSON(w, http.StatusOK, respBody)
 }
-func (h *Handler) handleStream(w http.ResponseWriter, r *http.Request, resp *http.Response, completionID, model, finalPrompt string, thinkingEnabled, searchEnabled bool, toolNames []string, toolsRaw any, historySession *chatHistorySession) {
+func (h *Handler) handleStream(w http.ResponseWriter, r *http.Request, resp *http.Response, completionID, model, finalPrompt string, refFileTokens int, thinkingEnabled, searchEnabled bool, toolNames []string, toolsRaw any, historySession *chatHistorySession) {
 	defer func() { _ = resp.Body.Close() }()
 	if resp.StatusCode != http.StatusOK {
 		body, _ := io.ReadAll(resp.Body)
@@ -212,7 +211,7 @@ func (h *Handler) handleStream(w http.ResponseWriter, r *http.Request, resp *htt
 	created := time.Now().Unix()
 	bufferToolContent := len(toolNames) > 0
 	emitEarlyToolDeltas := h.toolcallFeatureMatchEnabled() && h.toolcallEarlyEmitHighConfidence()
-	stripReferenceMarkers := h.compatStripReferenceMarkers()
+	stripReferenceMarkers := stripReferenceMarkersEnabled()
 	initialType := "text"
 	if thinkingEnabled {
 		initialType = "thinking"
@@ -231,9 +230,11 @@ func (h *Handler) handleStream(w http.ResponseWriter, r *http.Request, resp *htt
 		stripReferenceMarkers,
 		toolNames,
 		toolsRaw,
 		promptcompat.DefaultToolChoicePolicy(),
 		bufferToolContent,
 		emitEarlyToolDeltas,
 	)
 	streamRuntime.refFileTokens = refFileTokens
 	streamengine.ConsumeSSE(streamengine.ConsumeConfig{
 		Context:             r.Context(),
@@ -250,7 +251,7 @@ func (h *Handler) handleStream(w http.ResponseWriter, r *http.Request, resp *htt
 		OnParsed: func(parsed sse.LineResult) streamengine.ParsedDecision {
 			decision := streamRuntime.onParsed(parsed)
 			if historySession != nil {
-				historySession.progress(streamRuntime.thinking.String(), streamRuntime.text.String())
+				historySession.progress(streamRuntime.historyThinking(), streamRuntime.historyText())
 			}
 			return decision
 		},
@@ -264,14 +265,15 @@ func (h *Handler) handleStream(w http.ResponseWriter, r *http.Request, resp *htt
 				return
 			}
 			if streamRuntime.finalErrorMessage != "" {
-				historySession.error(streamRuntime.finalErrorStatus, streamRuntime.finalErrorMessage, streamRuntime.finalErrorCode, streamRuntime.thinking.String(), streamRuntime.text.String())
+				historySession.error(streamRuntime.finalErrorStatus, streamRuntime.finalErrorMessage, streamRuntime.finalErrorCode, streamRuntime.historyThinking(), streamRuntime.historyText())
 				return
 			}
-			historySession.success(http.StatusOK, streamRuntime.finalThinking, streamRuntime.finalText, streamRuntime.finalFinishReason, streamRuntime.finalUsage)
+			historySession.success(http.StatusOK, streamRuntime.historyThinking(), streamRuntime.historyText(), streamRuntime.finalFinishReason, streamRuntime.finalUsage)
 		},
 		OnContextDone: func() {
 			streamRuntime.markContextCancelled()
 			if historySession != nil {
-				historySession.stopped(streamRuntime.thinking.String(), streamRuntime.text.String(), string(streamengine.StopReasonContextCancelled))
+				historySession.stopped(streamRuntime.historyThinking(), streamRuntime.historyText(), string(streamengine.StopReasonContextCancelled))
 			}
 		},
 	})
--- a/internal/httpapi/openai/chat/handler_chat_auto_delete_test.go
+++ b/internal/httpapi/openai/chat/handler_chat_auto_delete_test.go
@@ -75,7 +75,6 @@ func TestChatCompletionsAutoDeleteModes(t *testing.T) {
 			}
 			h := &Handler{
 				Store: mockOpenAIConfig{
 					wideInput:      true,
 					autoDeleteMode: tc.mode,
 				},
 				Auth: streamStatusAuthStub{},
@@ -123,7 +122,6 @@ func TestAutoDeleteRemoteSessionIgnoresCanceledParentContext(t *testing.T) {
 	ds := &autoDeleteCtxDSStub{}
 	h := &Handler{
 		Store: mockOpenAIConfig{
 			wideInput:      true,
 			autoDeleteMode: "single",
 		},
 		DS: ds,
--- a/internal/httpapi/openai/chat/handler_toolcall_test.go
+++ b/internal/httpapi/openai/chat/handler_toolcall_test.go
@@ -1,6 +1,7 @@
 package chat
 import (
 	"context"
 	"encoding/json"
 	"io"
 	"net/http"
@@ -93,7 +94,7 @@ func TestHandleNonStreamReturns429WhenUpstreamOutputEmpty(t *testing.T) {
 	)
 	rec := httptest.NewRecorder()
-	h.handleNonStream(rec, resp, "cid-empty", "deepseek-v4-flash", "prompt", false, false, nil, nil, nil)
+	h.handleNonStream(rec, resp, "cid-empty", "deepseek-v4-flash", "prompt", 0, false, false, nil, nil, nil)
 	if rec.Code != http.StatusTooManyRequests {
 		t.Fatalf("expected status 429 for empty upstream output, got %d body=%s", rec.Code, rec.Body.String())
 	}
@@ -112,7 +113,7 @@ func TestHandleNonStreamReturnsContentFilterErrorWhenUpstreamFilteredWithoutOutp
 	)
 	rec := httptest.NewRecorder()
-	h.handleNonStream(rec, resp, "cid-empty-filtered", "deepseek-v4-flash", "prompt", false, false, nil, nil, nil)
+	h.handleNonStream(rec, resp, "cid-empty-filtered", "deepseek-v4-flash", "prompt", 0, false, false, nil, nil, nil)
 	if rec.Code != http.StatusBadRequest {
 		t.Fatalf("expected status 400 for filtered upstream output, got %d body=%s", rec.Code, rec.Body.String())
 	}
@@ -131,7 +132,7 @@ func TestHandleNonStreamReturns429WhenUpstreamHasOnlyThinking(t *testing.T) {
 	)
 	rec := httptest.NewRecorder()
-	h.handleNonStream(rec, resp, "cid-thinking-only", "deepseek-v4-pro", "prompt", true, false, nil, nil, nil)
+	h.handleNonStream(rec, resp, "cid-thinking-only", "deepseek-v4-pro", "prompt", 0, true, false, nil, nil, nil)
 	if rec.Code != http.StatusTooManyRequests {
 		t.Fatalf("expected status 429 for thinking-only upstream output, got %d body=%s", rec.Code, rec.Body.String())
 	}
@@ -150,7 +151,7 @@ func TestHandleNonStreamPromotesThinkingToolCallsWhenTextEmpty(t *testing.T) {
 	)
 	rec := httptest.NewRecorder()
-	h.handleNonStream(rec, resp, "cid-thinking-tool", "deepseek-v4-pro", "prompt", true, false, []string{"search"}, nil, nil)
+	h.handleNonStream(rec, resp, "cid-thinking-tool", "deepseek-v4-pro", "prompt", 0, true, false, []string{"search"}, nil, nil)
 	if rec.Code != http.StatusOK {
 		t.Fatalf("expected 200 for thinking tool calls, got %d body=%s", rec.Code, rec.Body.String())
 	}
@@ -181,7 +182,7 @@ func TestHandleNonStreamPromotesHiddenThinkingDSMLToolCallsWhenTextEmpty(t *test
 	)
 	rec := httptest.NewRecorder()
-	h.handleNonStream(rec, resp, "cid-hidden-thinking-tool", "deepseek-v4-pro", "prompt", false, false, []string{"search"}, nil, nil)
+	h.handleNonStream(rec, resp, "cid-hidden-thinking-tool", "deepseek-v4-pro", "prompt", 0, false, false, []string{"search"}, nil, nil)
 	if rec.Code != http.StatusOK {
 		t.Fatalf("expected 200 for hidden thinking tool calls, got %d body=%s", rec.Code, rec.Body.String())
 	}
@@ -211,7 +212,7 @@ func TestHandleStreamToolsPlainTextStreamsBeforeFinish(t *testing.T) {
 	rec := httptest.NewRecorder()
 	req := httptest.NewRequest(http.MethodPost, "/v1/chat/completions", nil)
-	h.handleStream(rec, req, resp, "cid6", "deepseek-v4-flash", "prompt", false, false, []string{"search"}, nil, nil)
+	h.handleStream(rec, req, resp, "cid6", "deepseek-v4-flash", "prompt", 0, false, false, []string{"search"}, nil, nil)
 	frames, done := parseSSEDataFrames(t, rec.Body.String())
 	if !done {
@@ -239,6 +240,118 @@ func TestHandleStreamToolsPlainTextStreamsBeforeFinish(t *testing.T) {
 	}
 }
 func TestHandleStreamThinkingDisabledDoesNotLeakHiddenFragmentContinuations(t *testing.T) {
 	h := &Handler{}
 	resp := makeSSEHTTPResponse(
 		`data: {"p":"response/fragments","o":"APPEND","v":[{"type":"THINK","content":"我们"}]}`,
 		`data: {"p":"response/fragments/-1/content","v":"被"}`,
 		`data: {"v":"要求"}`,
 		`data: {"p":"response/fragments","o":"APPEND","v":[{"type":"RESPONSE","content":"答"}]}`,
 		`data: {"p":"response/fragments/-1/content","v":"案"}`,
 		`data: [DONE]`,
 	)
 	rec := httptest.NewRecorder()
 	req := httptest.NewRequest(http.MethodPost, "/v1/chat/completions", nil)
 	h.handleStream(rec, req, resp, "cid-hidden-fragment", "deepseek-v4-flash", "prompt", 0, false, false, nil, nil, nil)
 	frames, done := parseSSEDataFrames(t, rec.Body.String())
 	if !done {
 		t.Fatalf("expected [DONE], body=%s", rec.Body.String())
 	}
 	content := strings.Builder{}
 	for _, frame := range frames {
 		choices, _ := frame["choices"].([]any)
 		for _, item := range choices {
 			choice, _ := item.(map[string]any)
 			delta, _ := choice["delta"].(map[string]any)
 			if c, ok := delta["content"].(string); ok {
 				content.WriteString(c)
 			}
 		}
 	}
 	if got := content.String(); got != "答案" {
 		t.Fatalf("expected only visible response text, got %q body=%s", got, rec.Body.String())
 	}
 }
 func TestHandleStreamEmitsSingleChoiceFramesForMultipleParsedParts(t *testing.T) {
 	h := &Handler{}
 	resp := makeSSEHTTPResponse(
 		`data: {"p":"response/fragments","o":"APPEND","v":[{"type":"THINK","content":"我们"},{"type":"THINK","content":"被"},{"type":"THINK","content":"要求"},{"type":"RESPONSE","content":"答"},{"type":"RESPONSE","content":"案"}]}`,
 		`data: [DONE]`,
 	)
 	rec := httptest.NewRecorder()
 	req := httptest.NewRequest(http.MethodPost, "/v1/chat/completions", nil)
 	h.handleStream(rec, req, resp, "cid-multi-parts", "deepseek-v4-pro", "prompt", 0, true, false, nil, nil, nil)
 	frames, done := parseSSEDataFrames(t, rec.Body.String())
 	if !done {
 		t.Fatalf("expected [DONE], body=%s", rec.Body.String())
 	}
 	var reasoning, content strings.Builder
 	for _, frame := range frames {
 		choices, _ := frame["choices"].([]any)
 		if len(choices) != 1 {
 			t.Fatalf("expected exactly one choice per stream frame, got %d frame=%#v body=%s", len(choices), frame, rec.Body.String())
 		}
 		choice, _ := choices[0].(map[string]any)
 		delta, _ := choice["delta"].(map[string]any)
 		reasoning.WriteString(asString(delta["reasoning_content"]))
 		content.WriteString(asString(delta["content"]))
 	}
 	if got := reasoning.String(); got != "我们被要求" {
 		t.Fatalf("first-choice-only client would miss reasoning tokens: got %q body=%s", got, rec.Body.String())
 	}
 	if got := content.String(); got != "答案" {
 		t.Fatalf("first-choice-only client would miss content tokens: got %q body=%s", got, rec.Body.String())
 	}
 }
 func TestHandleStreamCoalescesSmallContentDeltas(t *testing.T) {
 	h := &Handler{}
 	lines := make([]string, 0, 101)
 	for i := 0; i < 100; i++ {
 		b, _ := json.Marshal(map[string]any{
 			"p": "response/content",
 			"v": "字",
 		})
 		lines = append(lines, "data: "+string(b))
 	}
 	lines = append(lines, "data: [DONE]")
 	resp := makeSSEHTTPResponse(lines...)
 	rec := httptest.NewRecorder()
 	req := httptest.NewRequest(http.MethodPost, "/v1/chat/completions", nil)
 	h.handleStream(rec, req, resp, "cid-coalesce", "deepseek-v4-flash", "prompt", 0, false, false, nil, nil, nil)
 	frames, done := parseSSEDataFrames(t, rec.Body.String())
 	if !done {
 		t.Fatalf("expected [DONE], body=%s", rec.Body.String())
 	}
 	var content strings.Builder
 	contentDeltaFrames := 0
 	for _, frame := range frames {
 		choices, _ := frame["choices"].([]any)
 		if len(choices) != 1 {
 			t.Fatalf("expected exactly one choice per stream frame, got %d frame=%#v body=%s", len(choices), frame, rec.Body.String())
 		}
 		choice, _ := choices[0].(map[string]any)
 		delta, _ := choice["delta"].(map[string]any)
 		if c, ok := delta["content"].(string); ok {
 			contentDeltaFrames++
 			content.WriteString(c)
 		}
 	}
 	if got, want := content.String(), strings.Repeat("字", 100); got != want {
 		t.Fatalf("coalesced stream content mismatch: got %q want %q body=%s", got, want, rec.Body.String())
 	}
 	if contentDeltaFrames >= 100 {
 		t.Fatalf("expected coalescing to reduce 100 tiny content frames, got %d body=%s", contentDeltaFrames, rec.Body.String())
 	}
 }
 func TestHandleStreamIncompleteCapturedToolJSONFlushesAsTextOnFinalize(t *testing.T) {
 	h := &Handler{}
 	resp := makeSSEHTTPResponse(
@@ -248,7 +361,7 @@ func TestHandleStreamIncompleteCapturedToolJSONFlushesAsTextOnFinalize(t *testin
 	rec := httptest.NewRecorder()
 	req := httptest.NewRequest(http.MethodPost, "/v1/chat/completions", nil)
-	h.handleStream(rec, req, resp, "cid10", "deepseek-v4-flash", "prompt", false, false, []string{"search"}, nil, nil)
+	h.handleStream(rec, req, resp, "cid10", "deepseek-v4-flash", "prompt", 0, false, false, []string{"search"}, nil, nil)
 	frames, done := parseSSEDataFrames(t, rec.Body.String())
 	if !done {
@@ -282,7 +395,7 @@ func TestHandleStreamPromotesThinkingToolCallsOnFinalizeWithoutMidstreamIntercep
 	rec := httptest.NewRecorder()
 	req := httptest.NewRequest(http.MethodPost, "/v1/chat/completions", nil)
-	h.handleStream(rec, req, resp, "cid-thinking-stream", "deepseek-v4-pro", "prompt", true, false, []string{"search"}, nil, nil)
+	h.handleStream(rec, req, resp, "cid-thinking-stream", "deepseek-v4-pro", "prompt", 0, true, false, []string{"search"}, nil, nil)
 	frames, done := parseSSEDataFrames(t, rec.Body.String())
 	if !done {
@@ -291,20 +404,16 @@ func TestHandleStreamPromotesThinkingToolCallsOnFinalizeWithoutMidstreamIntercep
 	if !streamHasToolCallsDelta(frames) {
 		t.Fatalf("expected tool_calls delta from finalize fallback, body=%s", rec.Body.String())
 	}
 	reasoningSeen := false
 	for _, frame := range frames {
 		choices, _ := frame["choices"].([]any)
 		for _, item := range choices {
 			choice, _ := item.(map[string]any)
 			delta, _ := choice["delta"].(map[string]any)
 			if asString(delta["reasoning_content"]) != "" {
-				reasoningSeen = true
+				t.Fatalf("did not expect leaked reasoning_content markup, body=%s", rec.Body.String())
 			}
 		}
 	}
 	if !reasoningSeen {
 		t.Fatalf("expected reasoning_content to stream before finalize fallback, body=%s", rec.Body.String())
 	}
 	if streamFinishReason(frames) != "tool_calls" {
 		t.Fatalf("expected finish_reason=tool_calls, body=%s", rec.Body.String())
 	}
@@ -319,7 +428,7 @@ func TestHandleStreamPromotesHiddenThinkingDSMLToolCallsOnFinalize(t *testing.T)
 	rec := httptest.NewRecorder()
 	req := httptest.NewRequest(http.MethodPost, "/v1/chat/completions", nil)
-	h.handleStream(rec, req, resp, "cid-hidden-thinking-stream", "deepseek-v4-pro", "prompt", false, false, []string{"search"}, nil, nil)
+	h.handleStream(rec, req, resp, "cid-hidden-thinking-stream", "deepseek-v4-pro", "prompt", 0, false, false, []string{"search"}, nil, nil)
 	frames, done := parseSSEDataFrames(t, rec.Body.String())
 	if !done {
@@ -353,7 +462,7 @@ func TestHandleStreamEmitsDistinctToolCallIDsAcrossSeparateToolBlocks(t *testing
 	rec := httptest.NewRecorder()
 	req := httptest.NewRequest(http.MethodPost, "/v1/chat/completions", nil)
-	h.handleStream(rec, req, resp, "cid-multi", "deepseek-v4-flash", "prompt", false, false, []string{"read_file", "search"}, nil, nil)
+	h.handleStream(rec, req, resp, "cid-multi", "deepseek-v4-flash", "prompt", 0, false, false, []string{"read_file", "search"}, nil, nil)
 	frames, done := parseSSEDataFrames(t, rec.Body.String())
 	if !done {
@@ -419,7 +528,7 @@ func TestHandleStreamCoercesSchemaDeclaredStringArgumentsOnFinalize(t *testing.T
 		},
 	}
-	h.handleStream(rec, req, resp, "cid-string-protect", "deepseek-v4-flash", "prompt", false, false, []string{"Write"}, toolsRaw, nil)
+	h.handleStream(rec, req, resp, "cid-string-protect", "deepseek-v4-flash", "prompt", 0, false, false, []string{"Write"}, toolsRaw, nil)
 	frames, done := parseSSEDataFrames(t, rec.Body.String())
 	if !done {
@@ -451,3 +560,45 @@ func TestHandleStreamCoercesSchemaDeclaredStringArgumentsOnFinalize(t *testing.T
 	}
 	t.Fatalf("expected at least one streamed tool call delta, body=%s", rec.Body.String())
 }
 func TestHandleNonStreamWithRetryIncludesRefFileTokensInUsage(t *testing.T) {
 	h := &Handler{}
 	run := func(refFileTokens int) map[string]any {
 		resp := makeSSEHTTPResponse(
 			`data: {"p":"response/content","v":"hello world"}`,
 			`data: [DONE]`,
 		)
 		rec := httptest.NewRecorder()
 		h.handleNonStreamWithRetry(rec, context.Background(), nil, resp, nil, "", "cid-ref", "deepseek-v4-flash", "prompt", refFileTokens, false, false, nil, nil, nil)
 		if rec.Code != http.StatusOK {
 			t.Fatalf("expected 200, got %d body=%s", rec.Code, rec.Body.String())
 		}
 		return decodeJSONBody(t, rec.Body.String())
 	}
 	base := run(0)
 	withRef := run(7)
 	baseUsage, _ := base["usage"].(map[string]any)
 	refUsage, _ := withRef["usage"].(map[string]any)
 	if baseUsage == nil || refUsage == nil {
 		t.Fatalf("expected usage objects, base=%#v ref=%#v", base["usage"], withRef["usage"])
 	}
 	getInt := func(m map[string]any, key string) int {
 		t.Helper()
 		v, ok := m[key].(float64)
 		if !ok {
 			t.Fatalf("expected numeric %s, got %#v", key, m[key])
 		}
 		return int(v)
 	}
 	if got := getInt(refUsage, "prompt_tokens") - getInt(baseUsage, "prompt_tokens"); got != 7 {
 		t.Fatalf("expected prompt_tokens delta 7, got %d", got)
 	}
 	if got := getInt(refUsage, "total_tokens") - getInt(baseUsage, "total_tokens"); got != 7 {
 		t.Fatalf("expected total_tokens delta 7, got %d", got)
 	}
 }
--- a/internal/httpapi/openai/chat/ref_file_tokens.go
+++ b/internal/httpapi/openai/chat/ref_file_tokens.go
@@ -0,0 +1,26 @@
 package chat
 // addRefFileTokensToUsage adds inline-uploaded file token estimates to an existing
 // usage map inside a response object. This keeps the token accounting aware of file
 // content that the upstream model processes but that is not part of the prompt text.
 func addRefFileTokensToUsage(obj map[string]any, refFileTokens int) {
 	if refFileTokens <= 0 || obj == nil {
 		return
 	}
 	usage, ok := obj["usage"].(map[string]any)
 	if !ok || usage == nil {
 		return
 	}
 	for _, key := range []string{"input_tokens", "prompt_tokens"} {
 		if v, ok := usage[key]; ok {
 			if n, ok := v.(int); ok {
 				usage[key] = n + refFileTokens
 			}
 		}
 	}
 	if v, ok := usage["total_tokens"]; ok {
 		if n, ok := v.(int); ok {
 			usage["total_tokens"] = n + refFileTokens
 		}
 	}
 }
--- a/internal/httpapi/openai/chat/test_helpers_test.go
+++ b/internal/httpapi/openai/chat/test_helpers_test.go
@@ -12,25 +12,18 @@ import (
 type mockOpenAIConfig struct {
 	aliases             map[string]string
 	wideInput           bool
 	autoDeleteMode      string
 	toolMode            string
 	earlyEmit           string
 	responsesTTL        int
 	embedProv           string
 	historySplitEnabled bool
 	historySplitTurns   int
 	currentInputEnabled bool
 	currentInputMin     int
 	thinkingInjection   *bool
 	thinkingPrompt      string
 }
-func (m mockOpenAIConfig) ModelAliases() map[string]string { return m.aliases }
+func (m mockOpenAIConfig) ModelAliases() map[string]string     { return m.aliases }
 func (m mockOpenAIConfig) CompatWideInputStrictOutput() bool {
 	return m.wideInput
 }
 func (m mockOpenAIConfig) CompatStripReferenceMarkers() bool   { return true }
 func (m mockOpenAIConfig) ToolcallMode() string                { return m.toolMode }
 func (m mockOpenAIConfig) ToolcallEarlyEmitConfidence() string { return m.earlyEmit }
 func (m mockOpenAIConfig) ResponsesStoreTTLSeconds() int       { return m.responsesTTL }
@@ -41,14 +34,7 @@ func (m mockOpenAIConfig) AutoDeleteMode() string {
 	}
 	return m.autoDeleteMode
 }
-func (m mockOpenAIConfig) AutoDeleteSessions() bool  { return false }
+func (m mockOpenAIConfig) AutoDeleteSessions() bool      { return false }
 func (m mockOpenAIConfig) HistorySplitEnabled() bool { return m.historySplitEnabled }
 func (m mockOpenAIConfig) HistorySplitTriggerAfterTurns() int {
 	if m.historySplitTurns <= 0 {
 		return 1
 	}
 	return m.historySplitTurns
 }
 func (m mockOpenAIConfig) CurrentInputFileEnabled() bool { return m.currentInputEnabled }
 func (m mockOpenAIConfig) CurrentInputFileMinChars() int {
 	return m.currentInputMin
--- a/Show More
+++ b/Show More