B Blengi docs

Run your workspace

Picking a fast model

The chat model your agent uses is the single biggest knob for reply speed. A 200ms time-to-first-token (TTFT) feels instant; a 1.5s TTFT feels broken. Most "the bot is slow" complaints from buyers trace back to picking a heavyweight model when a fast one would have done.

Where to pick a model

Super-admin → Settings → System → AI providers. Each provider (Cloudflare, OpenAI, OpenRouter) ships a dropdown of curated models with a speed badge, a cost tier, and an estimated TTFT. The estimate comes from each vendor's published latency dashboard as of the catalogue's last update; the real number on your install can vary by ±30% depending on your region.

The "Test connection" button

Next to each model dropdown is a Test connection button. Click it and the server runs a one-token chat against the configured provider and reports the actual TTFT your install sees. The measurement is cached for 24 hours so reloading the page does not burn API budget. Click again to refresh.

If the probe fails — bad key, model deprecated, provider down — the button shows the error inline so you can fix it without leaving the page.

Default picks (June 2026)

ProviderPickWhy
Cloudflare Workers AI @cf/meta/llama-3.3-70b-instruct-fp8-fast Fastest 70B on Cloudflare. Free for most installs. Tool-calling reliable.
OpenAI gpt-4o-mini 10× cheaper than GPT-4o, ~3× faster TTFT. Quality good enough for sales-bot use.
OpenRouter meta-llama/llama-3.3-70b-instruct:free Free tier; rate-limited but enough for low-volume sites.

When to pick something slower

  • Long-form / nuanced replies (legal, finance, support escalations). Switch to GPT-4o or Claude 3.5 Sonnet. Buyers will accept slower TTFT in exchange for fewer wrong answers.
  • Huge knowledge bases. Gemini Flash 1.5 has a 1M-token context window. Fits anything in one prompt.
  • EU residency requirement. Mistral Small via OpenRouter routes to European infrastructure.

Automatic model fallback (self-heal)

A single-Cloudflare install no longer dies when its chat model is slow, cold-starting, or returns a 5xx. The app keeps a second Cloudflare chat model on standby and switches to it automatically for that turn — no second provider key required. By default the primary is the 70B model and the fallback is the fast 8B model (@cf/meta/llama-3.1-8b-instruct), so a slow heavyweight degrades to a quick lighter answer instead of an error.

  • Zero overhead when healthy. The fallback is only tried after the primary actually fails — a working primary pays nothing, and on the visitor hot path the switch can only happen before the first token, so the streaming latency budget is untouched.
  • Cloudflare first, then other providers. If you also have an OpenAI or OpenRouter key, the order is Cloudflare primary → Cloudflare fallback → OpenAI / OpenRouter. We exhaust Cloudflare's own models before hopping providers.
  • Override or disable it. Set CLOUDFLARE_CHAT_MODEL_FALLBACK to any other Workers AI chat slug, or to an empty value to turn the in-provider fallback off. A fallback equal to the primary is ignored (a same-model retry buys nothing).

Each switch is recorded as a provider failover event in the Widget Monitor, so a primary model that fails over constantly is visible — that's your signal to make the fallback the primary, or to size up the account.

Why a model is or is not in the catalogue

The catalogue (App\Services\Llm\ModelCatalog) is hand-curated. We surface models that:

  • Stream tokens via the OpenAI-style chat/completions endpoint.
  • Reliably honour the JSON tool-calling format (where flagged).
  • Are not deprecated by their vendor.

If your model is not in the dropdown, the picker still lets you paste a custom ID — the "Custom" entry pins at the top and falls back to the original free-text input. The Test connection button works against custom models too.

Caveats

  • Catalogue TTFT estimates drift. Vendors silently change inference hardware. Re-run Test connection monthly if you care about exact numbers.
  • Probe runs in your local timezone / region. A buyer in Tokyo will see different latency than one in Frankfurt for the same model. The measurement reflects whichever server the install runs on.
  • Probe consumes one token per click. At OpenAI's gpt-4o-mini rate, that is well under $0.0001. Negligible.