19. Provider usage, prompt-cache observability, and WebSocket continuation

Copy Markdown View Source

Date: 2026-06-09 Status: Accepted Implementation status: Provider usage/cache deterministic slice implemented; WebSocket probe surface implemented; production WebSocket transport in progress

Context

Pixir sends stateless OpenAI Responses calls with store: false and folds the local Log into Provider input on every Turn. That keeps the Log authoritative, but it also means cost, latency, and prompt-cache behavior are easy to misunderstand unless the Harness records the Provider's accounting.

Prompt cache is not a Session store. It is an optimization on stable prompt prefixes. The only trustworthy evidence that cache was used is Provider usage metadata such as cached_tokens in the final response.completed usage payload. Latency or subjective "it felt faster" observations are not enough.

The public OpenAI API supports automatic prompt caching for long prompts, prompt_cache_key as a routing hint for requests that share a stable prefix, and prompt_cache_retention for retention policy. Pixir's primary path is the ChatGPT/Codex subscription backend, so new request parameters must be treated as dialect features: send only fields that are known to be accepted or have a safe fallback.

OpenAI's Responses WebSocket mode is a separate optimization. It can keep a live connection and continue from a prior Response with previous_response_id while sending only the new input. The important distinction is:

  • Prompt cache is Provider-side prefix reuse and is proven by cached_tokens.
  • WebSocket continuation is transport/session-on-the-socket reuse and is proven by successful same-socket continuation semantics.
  • Neither mechanism is Pixir's durable source of truth. The local Log remains authoritative.

The older Pixir repo already carried a bounded WebSocket smoke that empirically probed the ChatGPT/Codex subscription path. That implementation is useful donor material for Pixir Harness. It should be ported as an opt-in smoke first so the product can safely move toward a WebSocket-first default transport without confusing validation probes with ordinary CI or local unit tests.

Decision

Pixir records Provider usage as a canonical provider_usage Event for every Provider call in a Turn, including repeated calls inside a tool loop. The event is durable Harness evidence and is excluded from Provider replay.

Pixir.Provider.stream/2 captures usage from the streamed response.completed event and returns both:

  • the raw usage object;
  • a normalized summary with input tokens, cached tokens, output tokens, reasoning tokens, total tokens, and cache-hit rate.

Normal Turn calls populate a deterministic prompt_cache_key when the Provider dialect accepts it. The key is a cache family, not an individual request id. It is short, bounded, and non-PII; it may include stable hashes for model, mode, Session/fork family, Tool set, and Skill index. It must not include raw workspace paths, user text, timestamps, request ids, emails, or secrets.

Pixir improves cache friendliness by keeping Tool specs and Skill index rendering deterministic. The current workspace path remains in early instructions for now because it is part of Pixir's existing agent contract; this is a known cache-rate drag across different workspaces and should be revisited only with a separate prompt-contract change.

Pixir keeps store: false. Extended prompt-cache retention (prompt_cache_retention: "24h") is not sent by default on the ChatGPT/Codex subscription path until a probe proves that backend accepts it. If enabled on a supported API path, it remains a request-level optimization and still does not make the Provider the source of truth.

Pixir adds an opt-in WebSocket smoke surface:

mix pixir.smoke.websocket --dry-run --json
mix pixir.smoke.websocket --json

The smoke is agent-useful: it has --help, --dry-run, JSON output, structured errors, bounded evidence, and no token/account-id printing. The real-network path probes:

  • WebSocket upgrade/handshake;
  • a minimal response.create;
  • same-socket previous_response_id continuation;
  • a tool-call loop continued with function_call_output;
  • reconnect behavior showing that store: false response ids are not durable cross-socket Session storage.

Pixir's Provider transport decision is WebSocket-first by default once support is proven for the active credential/dialect path. HTTP/SSE remains the fallback and debug path. The production transport should be configured as auto | websocket | http_sse:

  • auto is the default and prefers WebSocket when the dialect and credential path support it, then falls back to HTTP/SSE with a visible reason;
  • websocket requires WebSocket support and fails honestly when unavailable;
  • http_sse keeps the current stateless behavior.

The auto fallback must be a recovery policy, not a one-way downgrade. If WebSocket fails because the socket cannot be opened, closes unexpectedly, or emits malformed frames, Pixir may complete the Turn through HTTP/SSE so the user is not blocked. It then marks WebSocket as temporarily degraded and retries WebSocket on a later Turn or after backoff. If WebSocket recovers, auto returns to WebSocket. Provider/model failures such as response.failed are not transport failures and must not be hidden by fallback.

prompt_cache_key survives transport fallback. It is a routing hint for a stable prompt prefix family, so a full HTTP/SSE replay after WebSocket failure should carry the same safe cache-key family when the active dialect supports it. previous_response_id does not have the same portability: it is connection/Provider continuation state and may be discarded on fallback or reconnect without corrupting Pixir, because the local Log can always rebuild the Provider input.

Real-network runtime evidence on the ChatGPT/Codex subscription path showed the nuance: two cache-eligible WebSocket requests with the same prompt_cache_key but separate connections may still return cached_tokens: 0, while the same pair on one live WebSocket connection observed a cache hit. Therefore Pixir treats prompt_cache_key as worth preserving across transports, but not as a guarantee. WebSocket-first remains the better default because it preserves engine/socket affinity when the Session is linear.

Production WebSocket support should be implemented as a supervised Provider connection owned by a Session or Subagent. It may keep connection-local continuation state such as the latest Response id, but it must not replace canonical Events, local Log replay, compaction, or orphan tool-call repair.

Consequences

  • Pixir can answer "did this run hit prompt cache?" from durable evidence rather than guesses.
  • Subagent and Workflow cost/cache behavior becomes inspectable because every model call gets its own usage event.
  • Provider usage increases Log volume slightly, but the event is compact and string-keyed.
  • Replay remains clean because provider_usage is never converted into model input.
  • Cache keys can improve routing for fork-like workloads without exposing private paths or prompt content.
  • Sending prompt_cache_retention too aggressively could break the ChatGPT/Codex backend; keeping it gated avoids turning an optimization into a product reliability risk.
  • WebSocket continuation can reduce repeated payload transfer and improve latency for long, tool-heavy sessions, and is the right default for a Codex-first harness when the backend supports it. It still introduces connection lifecycle, reconnect, and fallback state that must be supervised explicitly.
  • HTTP/SSE fallback preserves reliability and should preserve the same safe prompt_cache_key, but fallback may lose WebSocket-only continuation state. This is acceptable because replay comes from Pixir's Log.
  • Parallel Subagents should use separate WebSocket connections. Responses WebSocket mode is not a general multiplexing layer for many in-flight responses on one socket.
  • Fork-like workloads can benefit from both cache-key routing and shared-prefix WebSocket continuation, but forks are different from handoffs: forks preserve a shared prefix and diverge late; handoffs summarize or transfer context.

Non-goals

  • Do not treat the WebSocket smoke task as the production transport. Runtime WebSocket support must live behind Provider transport policy and supervised connection modules.
  • Do not implement full Session fork UX in this slice.
  • Do not use Provider cache as Session persistence.
  • Do not use WebSocket previous_response_id as Session persistence.
  • Do not claim cache hits from latency alone.
  • Do not claim WebSocket performance wins without measuring bytes sent and latency.
  • Do not expose raw prompts, raw paths, or identities in cache keys.
  • Do not make network prompt-cache or WebSocket smokes part of ordinary tests or CI.

Verification Direction

The minimal contract is:

mix test test/pixir/provider_test.exs
mix test test/pixir/turn_test.exs
mix test test/pixir/event_test.exs
mix test test/pixir/skills_test.exs
mix pixir.smoke.prompt_cache --dry-run --json
mix pixir.smoke.websocket --dry-run --json
mix check

Regression coverage should prove:

  • provider_usage is canonical and string-keyed.
  • Provider replay excludes provider_usage.
  • response.completed usage is captured and normalized for current and legacy shapes.
  • Turn records one provider_usage Event per Provider call.
  • Default cache keys are stable, bounded, and do not include raw workspace paths or user text.
  • Tool specs and Skill index output are deterministic.
  • The prompt-cache smoke task exposes --help, --dry-run, --json, and structured errors.
  • The WebSocket smoke task exposes --help, --dry-run, --json, structured errors, and a no-auth/no-network/no-write dry-run contract.
  • The WebSocket real-network path remains opt-in and records enough evidence to distinguish handshake failure, provider error, same-socket continuation failure, tool loop failure, and reconnect/durable-state mismatch.

Representative real-network probes should use gpt-5.5 with low reasoning effort unless the user explicitly asks otherwise, because recent cache-routing experiments were standardized on that model family.

References