Observability
- Metrics: latency, token usage, error rates, WS connections
- Logs: structured JSON with correlation IDs
- Tracing: edge traces and spans across chat → tools → storage
- Alerts: error budgets, quota breaches, degraded regions
SLOs and SLAs
Service | SLO | Notes |
---|---|---|
Chat Streaming | p99 < 150ms step | Measured at edge SSE |
WebSocket Broadcast | p99 < 30ms | Durable Object affinity |
Auth Validation | p99 < 50ms | Token cache enabled |
Health Checks
- /health verifies DB, integrations, and secrets
- Synthetic probes per region
- Canary deploy with automatic rollback
Runbooks
- Stripe webhook retries
- Supabase outage fallback (read-only)
- Edge region failover