Skip to content

Provider Outage Triage Quick Guide

Use this quick guide when a provider starts failing or latency spikes.

5-Minute Flow

  1. Confirm process health:
    • curl -sS -f http://localhost:8317/health
  2. Confirm exposed models still look normal:
    • curl -sS http://localhost:8317/v1/models -H "Authorization: Bearer <api-key>" | jq '.data | length'
  3. Inspect provider metrics for the failing provider:
    • curl -sS http://localhost:8317/v1/metrics/providers | jq
  4. Check logs for repeated status codes (401, 403, 429, 5xx).
  5. Reroute critical traffic to fallback prefix/provider.

Decision Hints

SymptomLikely CauseImmediate Action
One provider has high error ratio, others healthyUpstream outage/degradationShift traffic to fallback provider prefix
Mostly 401/403Expired/invalid provider authRun auth refresh checks and manual refresh
Mostly 429Upstream throttlingLower concurrency and shift non-critical traffic
/v1/models missing expected modelsProvider config/auth problemRecheck provider block, auth file, and filters

Escalation Trigger

Escalate after 10 minutes if any one is true:

  • No successful requests for a critical workload.
  • Error ratio remains above on-call threshold after reroute.
  • Two independent providers are simultaneously degraded.

Last reviewed: 2026-02-21
Owner: Platform On-Call
Pattern: YYYY-MM-DD

MIT Licensed