- Increase StreamIdleTimeout from 90s to 300s and MaxKeepaliveCount from 10 to 40
to prevent premature stream termination with DeepSeek V4 Pro (~50K token contexts)
- Add r.Context().Err() check after ConsumeSSE in empty_retry_runtime (chat + responses)
to prevent historySession.error() from overwriting historySession.stopped()
when the request context is cancelled
References:
- MaxKeepaliveCount=10 creates a 50s no-content timeout that kills the stream
before DeepSeek V4 Pro can produce its first token with large contexts
- Hermes Agent reports 'No response from provider for 180s' because the
underlying SSE connection was already terminated by ds2api at 50s
- Context cancellation path: OnContextDone -> stopped(), then finalize()
with empty output -> retry -> error() overwrites stopped()
/admin/chat-history, /admin/proxies, /admin/dev/raw-samples,
and /admin/dev/captures were falling through to the SPA fallback
(/admin/index.html), causing "Unexpected token '<'" JSON parse
errors on the frontend.
Replace bufio.Scanner with bufio.NewReaderSize + ReadBytes('\n') across all
SSE read paths to preserve long single-line data (e.g. write_file content).
Add quasi_status and auto_continue handling as direct path-based patches in
both Go continue observer and Node vercel_stream_impl, mirroring existing
batch-patch logic. Add 2MiB+ line throughput tests at every SSE layer.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace hardcoded DSML typo variant lists in Go/Node tool call parsers with
generalized prefix consumption that tolerates repeated leading <, repeated DSML
prefix noise, and trailing pipe terminators. Split tiktoken-dependent token
counting into a build-tagged file for non-cgo platform compatibility. Add /data
directory to Dockerfile for bind-mount permissions.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Track byte sizes of inline-uploaded files during PreprocessInlineFileInputs and convert them to conservative token estimates (bytes/3). RefFileTokens is threaded through StandardRequest into all OpenAI chat/responses usage builders so returned prompt_tokens/input_tokens reflect the full upstream context cost including attached files.
PromptTokenText now reflects the actual downstream context cost: the uploaded IGNORE.txt file content plus the neutral live prompt, instead of only the pre-split prompt text.
Lock in the current_input_file regression with API-level tests and document that returned context token counts now track full prompt semantics with conservative sizing.
Unify Claude count_tokens, legacy stream accounting, and legacy render usage with preserved prompt text so Claude stops falling back to lossy message formatting.
Use the stored full-context prompt text for chat non-stream, stream, and retry accounting so current_input_file no longer shrinks returned prompt token counts.
Move OpenAI chat and responses usage accounting onto the shared tokenizer-aware counters so prompt and output usage stay model-aware and conservatively sized.