B Blengi docs

Troubleshooting

Slow replies: find the bottleneck

When a buyer says "the bot is slow", the first question is which stage is slow. A visitor turn crosses four network round-trips before the first character appears, and they fail independently:

StageWhat it isTypicalFix when slow
embed_ms Embedding the visitor's question (provider API call) 80โ€“250ms Provider region; retrieve cache absorbs repeats
ann_ms Vector similarity search (Vectorize or Qdrant) 80โ€“300ms remote / <10ms local Qdrant Run Qdrant on the app server (VECTOR_PROVIDER=qdrant)
rerank_ms Cross-encoder reranking of candidates 120โ€“400ms Lower RAG_RERANK_FAN_OUT; accept ANN order
first_token_ms Wait from turn start until the LLM's first streamed token 200โ€“900ms Switch to a faster chat model โ€” biggest single lever

The dashboard

Super-admin โ†’ /settings/system/hotpath-latency. Shows the last 100 visitor turns with one column per stage, p50 / p95 / max aggregate cards, and a plain-English verdict line naming the dominant cost with its remedy. Colour coding: green < 300ms, amber < 800ms, red above.

Data comes from a cache-backed ring buffer the stream handler appends to after the reply finishes streaming โ€” visitors pay zero latency for the bookkeeping. Buffer survives 7 days or 100 turns, whichever ends first. No database table involved, so it works identically on Redis and database cache drivers.

Reading the table

  • First token high, retrieval low โ€” the LLM is the bottleneck. Open Settings โ†’ System โ†’ AI providers and pick a faster model (the dropdown shows expected TTFT per model).
  • Retrieval high โ€” check which sub-column grew. Search high โ†’ vector store round-trip; running Qdrant locally takes it under 10ms. Rerank high โ†’ reduce fan-out. Embed high โ†’ provider region.
  • "retrieve cache hit" rows โ€” repeat questions skip embed/search/rerank entirely (30-minute cache). These rows show what your pipeline costs when retrieval is free.
  • Everything green but visitors still complain โ€” the problem is between the visitor's browser and your server: proxy buffering (see the SSE heartbeat notes in Architecture โ†’ Hot path), TLS setup time, or plain geography.

CLI: perf:hotpath

Same measurement, scriptable. Run after any change you hope made things faster (model switch, Qdrant migration, cache driver) and compare runs:

php artisan perf:hotpath                # 10 turns, first published agent
php artisan perf:hotpath --turns=25     # bigger sample
php artisan perf:hotpath --agent=<id>   # specific agent

The command sends real widget turns (same JWT + SSE path the embedded widget uses), prints wall-clock TTFB per turn, the same per-stage p50/p95 table as the dashboard, and the same verdict line. A unique suffix per message defeats the 30-minute retrieve cache so every turn exercises the full pipeline. When no provider keys are configured it warns that FakeOpenAi is bound โ€” those runs measure pipeline overhead only.

Raw logs

Every turn also writes a structured rag.turn line to the Laravel log with the same stage breakdown plus retrieve_timings. For historical analysis beyond the 100-turn buffer:

grep 'rag.turn' storage/logs/laravel.log | tail -50