Skip to content

Checks-to-Owner Responder Map

Route each failing check to the fastest owner path.

CheckPrimary OwnerSecondary OwnerFirst Response
GET /health failsRuntime On-CallPlatform On-CallVerify process/pod status, restart if needed
GET /v1/models fails/auth errorsAuth Runtime On-CallPlatform On-CallValidate API key, provider auth files, refresh path
GET /v1/metrics/providers shows one provider degradedPlatform On-CallProvider IntegrationsShift traffic to fallback prefix/provider
GET /v0/management/config returns 404Platform On-CallRuntime On-CallEnable remote-management.secret-key, restart
POST /v0/management/auths/{provider}/refresh failsAuth Runtime On-CallProvider IntegrationsValidate management key, rerun provider auth login
Logs show sustained 429Platform On-CallCapacity OwnerReduce concurrency, add credentials/capacity

Paging Guidelines

  1. Page primary owner immediately when critical user traffic is impacted.
  2. Add secondary owner if no mitigation within 10 minutes.
  3. Escalate incident lead when two or more critical checks fail together.

Last reviewed: 2026-02-21
Owner: Incident Commander Rotation
Pattern: YYYY-MM-DD

MIT Licensed