Troubleshooting
Widget Monitor: errors & provider failover
The Widget Monitor at
/settings/system/widget-monitor (super-admin only) is the
error counterpart to the hot-path
latency dashboard. Latency answers “why is the bot slow?”; the
Widget Monitor answers “what's breaking?” — stream failures,
LLM-provider outages, and visitor-reported freezes, all in one feed you can
triage and mark resolved as you fix each one.
What it tracks
Each row is one failure or anomaly on the visitor path:
- Provider down — every configured LLM provider failed for a turn. This is the one to act on first: a real outage your visitors felt.
- Stream failed — an unhandled exception killed a reply mid-flight. Open it to read the exception class and trace the cause.
- Failover — the primary provider stumbled but a backup picked the turn up. Informational: the visitor saw nothing wrong, but a rising count means the primary is unhealthy.
- Client freeze — the visitor's browser saw the stream stall (no data for 35s, or the 120s ceiling) and gave up. Reported by the widget itself, so you catch freezes even when the server looks fine.
- Tool-loop timeout / Retrieval failed — slower-path failures in tool calling or knowledge retrieval.
Filter by time window (24h / 7d / 30d), type, severity, and open-vs-resolved. The banner at the top gives a plain-English verdict (“provider outage in progress…”, “healthy with hiccups…”, “all clear…”) so you can read the state at a glance. Click Resolve on a row once you've dealt with it — or Reopen if it comes back.
Provider failover — why a single outage no longer kills every chat
The reliability backbone behind the monitor is runtime provider failover. Pitchbar can be configured with more than one LLM provider (Cloudflare Workers AI, OpenAI, OpenRouter). When two or more are configured, every visitor turn runs through a failover chain: if the primary provider is slow, returns a 5xx, is rate-limited (429), or has run out of credits, Pitchbar transparently retries the next provider — same turn, before the visitor sees an error.
- Retryable failures (timeouts, 429s, 5xx, connection errors) trigger failover — the provider is the problem, so another may succeed.
- Client errors (a 4xx “bad request”) are not retried — every provider would reject them identically, so the error surfaces immediately instead of doubling the latency.
- Streaming only fails over before the first token. Once the visitor has seen text, a mid-stream drop can't be silently restarted on another provider — it surfaces as a stream failure.
OPENAI_API_KEY or OPENROUTER_API_KEY alongside
your Cloudflare credentials) so a single outage becomes invisible to
visitors instead of fatal.
Fail-fast timeouts
Blocking provider calls (tool decisions, embeddings) are capped well below the streaming budget so a hung provider is abandoned quickly and failover reaches the next one fast, rather than waiting out a full minute. Streaming keeps a generous budget because a legitimate long answer holds the connection open; a 5-second connect cap still detects a dead host immediately.
Tuning the freeze detector
“Client freeze” events come from the widget's own stall detector. If you see them clustered, cross-reference the hot-path latency monitor — a freeze is usually a turn that genuinely took longer than the widget's patience (35s without a single byte), which points at provider responsiveness or an overloaded tool loop rather than a hard crash.