Troubleshooting

Widget Monitor: errors & provider failover

The Widget Monitor at /settings/system/widget-monitor (super-admin only) is the error counterpart to the hot-path latency dashboard. Latency answers “why is the bot slow?”; the Widget Monitor answers “what's breaking?” — stream failures, LLM-provider outages, and visitor-reported freezes, all in one feed you can triage and mark resolved as you fix each one.

Recorded off the hot path

Every event is written by a queued job, never a synchronous database write inside the live stream. Capturing a failure never slows a working chat — the first-token latency contract is untouched.

What it tracks

Each row is one failure or anomaly on the visitor path:

Provider down — every configured LLM provider failed for a turn. This is the one to act on first: a real outage your visitors felt.
Stream failed — an unhandled exception killed a reply mid-flight. Open it to read the exception class and trace the cause. The event is stamped with the provider that was actually serving the turn (e.g. cloudflare, cloudflare-fallback, openai), so you see the culprit on the failure itself instead of cross-referencing a separate Provider down row. The same provider tag is written to the widget.llm_failed / widget.stream_unhandled log lines and the conversation debugger's error trace.
Failover — the primary provider stumbled but a backup picked the turn up. Informational: the visitor saw nothing wrong, but a rising count means the primary is unhealthy.
Client freeze — the visitor's browser saw the stream stall (no data for 35s, or the 120s ceiling) and gave up. Reported by the widget itself, so you catch freezes even when the server looks fine.
Tool-loop timeout / Retrieval failed — slower-path failures in tool calling or knowledge retrieval.

Filter by time window (24h / 7d / 30d), type, severity, and open-vs-resolved. The banner at the top gives a plain-English verdict (“provider outage in progress…”, “healthy with hiccups…”, “all clear…”) so you can read the state at a glance. Click Resolve on a row once you've dealt with it — or Reopen if it comes back.

Provider failover — why a single outage no longer kills every chat

The reliability backbone behind the monitor is runtime provider failover. Pitchbar can be configured with more than one LLM provider (Cloudflare Workers AI, OpenAI, OpenRouter). When two or more are configured, every visitor turn runs through a failover chain: if the primary provider is slow, returns a 5xx, is rate-limited (429), or has run out of credits, Pitchbar transparently retries the next provider — same turn, before the visitor sees an error.

Retryable failures (timeouts, 429s, 5xx, connection errors) trigger failover — the provider is the problem, so another may succeed.
Client errors (a 4xx “bad request”) are not retried — every provider would reject them identically, so the error surfaces immediately instead of doubling the latency.
Streaming only fails over before the first token. Once the visitor has seen text, a mid-stream drop can't be silently restarted on another provider — it surfaces as a stream failure.

Configure a backup provider

Failover only helps if there's somewhere to fail over to. With a single provider configured, an outage still kills every turn — and the monitor will say so (“provider outage with NO failover”). Add a second provider's key in System Settings (an OPENAI_API_KEY or OPENROUTER_API_KEY alongside your Cloudflare credentials) so a single outage becomes invisible to visitors instead of fatal.

Fail-fast timeouts

Blocking provider calls (tool decisions, embeddings) are capped well below the streaming budget so a hung provider is abandoned quickly and failover reaches the next one fast, rather than waiting out a full minute. Streaming keeps a generous budget because a legitimate long answer holds the connection open; a 5-second connect cap still detects a dead host immediately.

Tuning the freeze detector

“Client freeze” events come from the widget's own stall detector. If you see them clustered, cross-reference the hot-path latency monitor — a freeze is usually a turn that genuinely took longer than the widget's patience (35s without a single byte), which points at provider responsiveness or an overloaded tool loop rather than a hard crash.