Provider Operations Runbook

This runbook is for operators who care about provider uptime, quota health, and routing quality.

Daily Checks

Health check:
- curl -sS http://localhost:8317/health
Model inventory:
- curl -sS http://localhost:8317/v1/models -H "Authorization: Bearer <api-key>" | jq '.data | length'
Provider metrics:
- curl -sS http://localhost:8317/v1/metrics/providers | jq
Log scan:
- Verify no sustained bursts of 401, 403, or 429.
Spark eligibility check (Copilot/Codex):
- curl -sS http://localhost:8317/v1/models -H "Authorization: Bearer <api-key>" | jq -r '.data[].id' | rg 'gpt-5.3-codex|gpt-5.3-codex-spark'

Quota Visibility (`#146` scope)

Current operational source of truth:
- v1/metrics/providers
- Management auth snapshots (/v0/management/auth-files)
- Kiro quota snapshot endpoint: /v0/management/kiro-quota (includes remaining_quota, usage_percentage, quota_exhausted)
Treat repeated 429 + falling success ratio as quota pressure and rotate capacity accordingly.

Kiro Remaining Quota Probe

bash

AUTH_KEY="replace-with-management-secret"
curl -sS http://localhost:8317/v0/management/kiro-quota \
  -H "Authorization: Bearer $AUTH_KEY" | jq

If multiple Kiro credentials exist, map and query by index:

bash

curl -sS http://localhost:8317/v0/management/auth-files \
  -H "Authorization: Bearer $AUTH_KEY" \
  | jq -r '.[] | .auth_index // .index'

curl -sS "http://localhost:8317/v0/management/kiro-quota?auth_index=<auth-index>" \
  -H "Authorization: Bearer $AUTH_KEY" | jq

Suggested alert policy:

Warn: any credential returns quota_exhausted=true.
Warn: 429 ratio > 5% over 10 minutes.
Critical: 429 ratio > 10% over 10 minutes OR steady quota_exhausted=true across top 2 providers.
Action: enable fallback toggles and rotate to alternate credentials:
- quota-exceeded.switch-project=true
- quota-exceeded.switch-preview-model=true

Onboard a New Provider

Add provider block in config.yaml (openai-compatibility preferred for OpenAI-style upstreams).
Add prefix for tenant/workload isolation.
Add models aliases for client-stable names.
Validate /v1/models output includes expected IDs.
Run canary request through the new prefix.
Monitor v1/metrics/providers for 10-15 minutes before production traffic.

Rotation and Quota Strategy

Configure multiple credentials per provider where supported.
Keep at least one alternate provider for each critical workload class.
Use prefixes to separate high-priority traffic from best-effort traffic.
If one provider is degraded, reroute by updating model prefix policy and aliases.

Incident Playbooks

Repeated `401/403`

Recheck credential validity and token freshness.
For OAuth providers (kiro, cursor, minimax, roo), verify token files and refresh path.
Confirm client is hitting intended provider prefix.

Repeated `429`

Add capacity (extra keys/providers) or reduce concurrency.
Shift traffic to fallback provider prefix.
Tighten expensive-model exposure with excluded-models.

Wrong Provider Selected

Inspect force-model-prefix and model naming in requests.
Verify alias collisions across provider blocks.
Prefer explicit prefix/model calls for sensitive workloads.

Missing Models in `/v1/models`

Confirm provider block is enabled and auth loaded.
Check model filters (models, excluded-models) and prefix constraints.
Verify upstream provider currently serves requested model.

Tool-Result Image Translation Regressions

Symptom pattern: tool responses containing image blocks fail after translation between OpenAI-compatible and Claude-style payloads.
First checks:
- Reproduce with a non-stream request and compare with stream behavior.
- Inspect request/response logs for payload-shape mismatches around tool_result + image content blocks.
Operational response:
- Keep one canary scenario that includes image content in tool results.
- Alert when canary success rate drops or 4xx translation errors spike for that scenario.
- Route impacted traffic to a known-good provider prefix while triaging translator output.

Stream/Non-Stream Usage Parity Check

Goal: confirm token usage fields are consistent between stream and non-stream responses for the same prompt.
Commands:
- Non-stream:
  - curl -sS http://localhost:8317/v1/responses -H "Authorization: Bearer <api-key>" -H "Content-Type: application/json" -d '{"model":"gpt-5.1-codex","input":[{"role":"user","content":"ping"}],"stream":false}' | tee /tmp/nonstream.json | jq '{input_tokens: .usage.input_tokens, output_tokens: .usage.output_tokens, total_tokens: .usage.total_tokens}'
- Stream (extract terminal usage event):
  - curl -sN http://localhost:8317/v1/responses -H "Authorization: Bearer <api-key>" -H "Content-Type: application/json" -d '{"model":"gpt-5.1-codex","input":[{"role":"user","content":"ping"}],"stream":true}' | rg '^data:' | sed 's/^data: //' | jq -c 'select(.usage? != null) | {input_tokens: (.usage.input_tokens // .usage.prompt_tokens), output_tokens: (.usage.output_tokens // .usage.completion_tokens), total_tokens: .usage.total_tokens}' | tail -n 1 | tee /tmp/stream-usage.json
- Compare:
  - diff -u <(jq -S . /tmp/nonstream.json | jq '{input_tokens: .usage.input_tokens, output_tokens: .usage.output_tokens, total_tokens: .usage.total_tokens}') <(jq -S . /tmp/stream-usage.json)
Pass criteria:
- diff is empty, or any difference is explainable by provider-side truncation/stream interruption.

iFlow OAuth model visibility is narrower than expected

Symptom: login/auth succeeds, but only a subset of iflow/* models appear or work.
Immediate checks:
- curl -sS http://localhost:8317/v1/models -H "Authorization: Bearer <api-key>" | jq -r '.data[].id' | rg '^iflow/'
- Validate request model is exactly one of the exposed IDs.
Mitigation:
- Do not assume upstream catalog parity after OAuth login.
- Keep a known-good iFlow canary model and gate rollout on successful canary responses.

iFlow account errors shown in terminal

Symptom: terminal output shows account-level iFlow errors but requests keep retrying noisily.
Immediate checks:
- rg -n "iflow|account|retry|cooldown|429|403" logs/*.log
- curl -sS http://localhost:8317/v1/metrics/providers | jq '.iflow // .providers.iflow'
Mitigation:
- Alert on sustained iFlow error-rate spikes (>5% over 10m).
- Keep one known-good iFlow canary request in non-stream mode.
- Rotate traffic away from iFlow prefix when account-level failures persist beyond cooldown windows.

Usage dashboard shows zeros under load

Symptom: traffic volume rises but usage counters remain 0.
Immediate checks:
- Run one non-stream and one stream request against the same model and compare emitted usage fields/log lines.
- Verify provider metrics endpoint still records request/error activity.
Mitigation:
- Treat missing upstream usage as a provider payload gap, not a transport success signal.
- Keep stream/non-stream parity probes in pre-release checks.

Antigravity / CLA CLI support matrix (`CPB-0743`)

Symptom: antigravity clients intermittently produce empty payloads or different behavior between antigravity-cli and CLIProxyAPI Plus front-end calls.
Immediate checks:
- Confirm model coverage:
  - curl -sS http://localhost:8317/v1/models -H "Authorization: Bearer <api-key>" | jq -r '.data[].id' | rg '^antigravity/'
- Confirm supported CLI client class:
  - curl -sS http://localhost:8317/v0/management/config -H "Authorization: Bearer <management-secret>" | jq '.providers[] | select(.name==\"antigravity\") | .supported_clients'
- Confirm request translation path in logs:
  - rg -n "antigravity|claude|tool_use|custom_model|request.*model" logs/*.log
Suggested matrix checks:
- antigravity-cli should map to supported auth-backed model IDs.
- Provider alias mode should keep aliases explicit in /v1/models.
- Tool/callback-heavy workloads should pass through without dropping tool_use boundaries.
Mitigation:
- If parity is missing, align source request to provider-native model IDs and re-check with a non-stream request first.
- Route unsupported workloads through mapped aliases using ampcode.model-mappings and document temporary exclusion.
- Keep a canary for each supported antigravity/* model with 10-minute trend windows.

Copilot Spark Mismatch (`gpt-5.3-codex-spark`)

Symptom: plus/team users get 400/404 model_not_found for gpt-5.3-codex-spark.
Immediate action:
- Confirm presence in GET /v1/models for the exact client API key.
- If absent, route workloads to gpt-5.3-codex and keep Spark disabled for that segment.
Suggested alert thresholds:
- Warn: Spark error ratio > 2% over 10 minutes.
- Critical: Spark error ratio > 5% over 10 minutes.
- Auto-mitigation: fallback alias to gpt-5.3-codex when critical threshold is crossed.

Codex 5.3 integration path (non-subprocess first)

Preferred path:
- Embed via sdk/cliproxy when the caller owns the runtime process.
HTTP fallback path:
- Use /v1/* only when crossing process boundaries.
Negotiation checks:
- Probe /health and /v1/models before enabling codex5.3-specific flows.
- Gate advanced behavior on observed model exposure (gpt-5.3-codex, gpt-5.3-codex-spark).

Amp traffic does not route through CLIProxyAPI

Symptom: Amp appears to call upstream directly and proxy logs remain idle.
Immediate checks:
- Ensure Amp process has OPENAI_API_BASE=http://127.0.0.1:8317/v1.
- Ensure Amp process has OPENAI_API_KEY=<client-key>.
- Run one direct canary request with identical env and confirm it appears in proxy logs.
Mitigation:
- Standardize Amp launch wrappers to export proxy env explicitly.
- Add startup validation that fails early when base URL does not target CLIProxyAPI.

Windows duplicate auth-file display safeguards

Symptom: auth records appear duplicated in management/UI surfaces on Windows.
Immediate checks:
- Confirm auth filename normalization output is stable across refresh/reload cycles.
- curl -sS http://localhost:8317/v0/management/auth-files -H "X-Management-Secret: <secret>" | jq '.[].filename' | sort | uniq -c
Rollout safety:
- Gate deployments with one Windows canary that performs add -> refresh -> list -> restart -> list.
- Block promotion when duplicate filename count changes after restart.

Metadata naming conventions for provider quota/refresh commands

Use consistent names across docs, APIs, and operator runbooks:

provider_key
model_id
quota_remaining
quota_reset_seconds
refresh_state

Avoid per-tool aliases for these fields in ops docs to keep telemetry queries deterministic.

TrueNAS Apprise notification DX checks

Validate target endpoint formatting before enabling alerts:
- apprise -vv --dry-run "<apprise-url>"
Send one canary alert for routing incidents:
- apprise "<apprise-url>" -t "cliproxy canary" -b "provider routing notification check"
Keep this notification path non-blocking for request handling; alerts should not gate proxy response paths.

Gemini thinking-length control drift (OpenAI-compatible clients)

Symptom: client requests a specific thinking level/budget but observed behavior looks unbounded or unchanged.
Immediate checks:
- Inspect request/response pair and compare with runtime debug lines:
  - thinking: original config from request
  - thinking: processed config to apply
- Confirm requested model and its thinking-capable alias are exposed in /v1/models.
Suggested alert thresholds:
- Warn: processed thinking mode mismatch ratio > 2% over 10 minutes.
- Critical: processed thinking mode mismatch ratio > 5% over 10 minutes.
- Warn: reasoning token growth > 25% above baseline for fixed-thinking workloads over 10 minutes.
Mitigation:
- Force explicit thinking-capable model alias for affected workloads.
- Reduce rollout blast radius by pinning the model suffix/level per workload class.
- Keep one non-stream and one stream canary for each affected client integration.

Recommended Production Pattern

One direct primary provider for latency-critical traffic.
One aggregator fallback provider for model breadth.
Prefix-based routing policy per workload class.
Metrics and alerting tied to error ratio, latency, and provider availability.

Provider Operations Runbook ​

Daily Checks ​

Quota Visibility (#146 scope) ​

Kiro Remaining Quota Probe ​

Onboard a New Provider ​

Rotation and Quota Strategy ​

Incident Playbooks ​

Repeated 401/403 ​

Repeated 429 ​

Wrong Provider Selected ​

Missing Models in /v1/models ​

Tool-Result Image Translation Regressions ​

Stream/Non-Stream Usage Parity Check ​

iFlow OAuth model visibility is narrower than expected ​

iFlow account errors shown in terminal ​

Usage dashboard shows zeros under load ​

Antigravity / CLA CLI support matrix (CPB-0743) ​

Copilot Spark Mismatch (gpt-5.3-codex-spark) ​

Codex 5.3 integration path (non-subprocess first) ​

Amp traffic does not route through CLIProxyAPI ​

Windows duplicate auth-file display safeguards ​

Metadata naming conventions for provider quota/refresh commands ​

TrueNAS Apprise notification DX checks ​

Gemini thinking-length control drift (OpenAI-compatible clients) ​

Recommended Production Pattern ​

Related Docs ​