B Blengi docs

Operate

Observability

Production telemetry runs on three legs: Sentry for errors, OpenTelemetry traces for the hot path, Horizon for queue health. The admin Site Health pill (see Site health & failed jobs) summarizes them at a glance.

Sentry

Set SENTRY_DSN and unhandled exceptions across the app flow into Sentry with stack traces, request context, and user/workspace metadata. The breadcrumb trail captures the last 100 log lines for every error. Integrated via sentry-laravel.

Useful filters in Sentry:

  • Tag workspace_id to scope errors to a tenant.
  • Tag agent_id when the error originates in a widget request.
  • Release tags match deploy commit SHA β€” easy to bisect when a regression appears.

OpenTelemetry traces

The OTEL exporter ships traces to OTEL_EXPORTER_OTLP_ENDPOINT β€” typically Honeycomb or Grafana Cloud Tempo. Spans wrap the hot path:

  • widget.message.receive β€” incoming HTTP, validation, JWT verify.
  • rag.curated.match β€” short-circuit check.
  • rag.embed β€” query embedding call.
  • rag.vector.search β€” ANN search.
  • rag.rerank β€” cross-encoder.
  • rag.prompt.assemble β€” local CPU work.
  • rag.llm.first_token β€” time-to-first-token (the headline metric).
  • rag.llm.stream β€” full stream duration.
  • rag.persist.async β€” post-stream save.

Each span is tagged with workspace_id, agent_id, conversation_id, provider (cloudflare / openai), and any cache-hit flags. The big one is p95 of rag.llm.first_token β€” that's your hot-path SLO.

Horizon

/horizon is the queue dashboard. Required for production β€” without it, you're blind to backlogs. Watch:

  • Wait time β€” how long jobs sit before being picked up. Healthy is < 1s; investigate > 10s.
  • Throughput β€” jobs/min by queue.
  • Failed jobs β€” anything that lands in failed_jobs shows here too.

Queues to monitor:

QueueWhat's on it
defaultMisc: usage events, gap detection, audit logs, webhook deliveries.
crawlCrawlSourceJob, CrawlPageJob, IngestNotionPageJob, IngestGoogleDocJob. Tends to be the longest queue depth.
indexIndexDocumentJob, IndexTextSourceJob. Embedding-heavy.

Logs

Standard Laravel logging. Default channels:

  • stdout β€” captured by Laravel Cloud / Docker.
  • sentry β€” error level and above.
  • slack β€” critical level, posts to ops channel.

Tail logs locally with php artisan pail.

Health endpoint

GET /up is the readiness probe β€” returns 200 with a small JSON body if the app boots. Use it for load balancer health checks. For deeper checks, App\Support\PlatformAdminHeader runs the multi-step health check and exposes the result via the Inertia shared prop on every admin page.

Metrics to watch

The handful of metrics that matter most:

  • p95 first-token latency β€” < 1s.
  • p95 full-response latency β€” < 5s for short answers.
  • Crawl queue depth β€” should drain within minutes.
  • Index queue depth β€” should drain within minutes.
  • Failed-jobs count β€” 0 in steady state. Anything > 50 is alarm-worthy.
  • LLM provider error rate β€” < 1% of streams.
  • Vector store query latency β€” p95 < 100ms.

Alerts

Recommended PagerDuty / Slack alerts:

  • Sentry β€” new release-blocking error.
  • Honeycomb β€” first-token p95 > 1.5s for 5 minutes.
  • Horizon β€” failed-jobs delta > 10 in 5 minutes.
  • Stripe webhook β€” > 5 consecutive verification failures (signing key mismatch).
  • Reverb β€” process down.

Site Health pill

The header pill in the admin panel is a quick visual check that everything is configured. Green is the steady state; if it goes amber, the dropdown tells you exactly which check failed and links to the settings page to fix it. See Site health & failed jobs.