Provider Outage Triage Quick Guide

Use this quick guide when a provider starts failing or latency spikes.

5-Minute Flow

Confirm process health:
- curl -sS -f http://localhost:8317/health
Confirm exposed models still look normal:
- curl -sS http://localhost:8317/v1/models -H "Authorization: Bearer <api-key>" | jq '.data | length'
Inspect provider metrics for the failing provider:
- curl -sS http://localhost:8317/v1/metrics/providers | jq
Check logs for repeated status codes (401, 403, 429, 5xx).
Reroute critical traffic to fallback prefix/provider.

Decision Hints

Symptom	Likely Cause	Immediate Action
One provider has high error ratio, others healthy	Upstream outage/degradation	Shift traffic to fallback provider prefix
Mostly `401/403`	Expired/invalid provider auth	Run auth refresh checks and manual refresh
Mostly `429`	Upstream throttling	Lower concurrency and shift non-critical traffic
`/v1/models` missing expected models	Provider config/auth problem	Recheck provider block, auth file, and filters

Escalation Trigger

Escalate after 10 minutes if any one is true:

No successful requests for a critical workload.
Error ratio remains above on-call threshold after reroute.
Two independent providers are simultaneously degraded.

Last reviewed: 2026-02-21
Owner: Platform On-Call
Pattern: YYYY-MM-DD