Troubleshooting
Slow replies: find the bottleneck
When a buyer says "the bot is slow", the first question is which stage is slow. A visitor turn crosses four network round-trips before the first character appears, and they fail independently:
| Stage | What it is | Typical | Fix when slow |
|---|---|---|---|
embed_ms |
Embedding the visitor's question (provider API call) | 80โ250ms | Provider region; retrieve cache absorbs repeats |
ann_ms |
Vector similarity search (Vectorize or Qdrant) | 80โ300ms remote / <10ms local Qdrant | Run Qdrant on the app server (VECTOR_PROVIDER=qdrant) |
rerank_ms |
Cross-encoder reranking of candidates | 120โ400ms | Lower RAG_RERANK_FAN_OUT; accept ANN order |
first_token_ms |
Wait from turn start until the LLM's first streamed token | 200โ900ms | Switch to a faster chat model โ biggest single lever |
The dashboard
Super-admin โ /settings/system/hotpath-latency.
Shows the last 100 visitor turns with one column per stage,
p50 / p95 / max aggregate cards, and a plain-English verdict line
naming the dominant cost with its remedy. Colour coding: green
< 300ms, amber < 800ms, red above.
Data comes from a cache-backed ring buffer the stream handler appends to after the reply finishes streaming โ visitors pay zero latency for the bookkeeping. Buffer survives 7 days or 100 turns, whichever ends first. No database table involved, so it works identically on Redis and database cache drivers.
Reading the table
- First token high, retrieval low โ the LLM is the bottleneck. Open Settings โ System โ AI providers and pick a faster model (the dropdown shows expected TTFT per model).
-
Retrieval high โ check which sub-column grew.
Searchhigh โ vector store round-trip; running Qdrant locally takes it under 10ms.Rerankhigh โ reduce fan-out.Embedhigh โ provider region. - "retrieve cache hit" rows โ repeat questions skip embed/search/rerank entirely (30-minute cache). These rows show what your pipeline costs when retrieval is free.
- Everything green but visitors still complain โ the problem is between the visitor's browser and your server: proxy buffering (see the SSE heartbeat notes in Architecture โ Hot path), TLS setup time, or plain geography.
CLI: perf:hotpath
Same measurement, scriptable. Run after any change you hope made things faster (model switch, Qdrant migration, cache driver) and compare runs:
php artisan perf:hotpath # 10 turns, first published agent
php artisan perf:hotpath --turns=25 # bigger sample
php artisan perf:hotpath --agent=<id> # specific agent
The command sends real widget turns (same JWT + SSE path the embedded widget uses), prints wall-clock TTFB per turn, the same per-stage p50/p95 table as the dashboard, and the same verdict line. A unique suffix per message defeats the 30-minute retrieve cache so every turn exercises the full pipeline. When no provider keys are configured it warns that FakeOpenAi is bound โ those runs measure pipeline overhead only.
Raw logs
Every turn also writes a structured rag.turn line to
the Laravel log with the same stage breakdown plus
retrieve_timings. For historical analysis beyond the
100-turn buffer:
grep 'rag.turn' storage/logs/laravel.log | tail -50