What we show
Every request through the whyllm proxy becomes one immutable span — 49 indexed fields across 9 categories, with the raw request and response bodies stored verbatim. Here is every single one.
Anatomy of a captured call
This is the real span schema — not a marketing summary. Tap any tile to inspect exactly what whyllm records.
Open by default
whyllm's span schema is aligned with the OpenTelemetry GenAI semantic conventions — the fields map straight across, so your data stays portable. Export it, pipe it into any OTel backend, leave whenever you want.
Captures calls to every major provider
Feature tour
Capture is the start. These four pillars are what whyllm does with it.
You can’t debug a call you never saw.
Every LLM call becomes a searchable span, and multi-step agent calls are stitched into one trace. Filter by model, environment, user, or tag and land on the exact request in two clicks.
The invoice should never be the first you hear of it.
Not just charts — enforcement. Set budgets per project, key, or user; the proxy returns HTTP 429 before an over-budget call ever reaches the provider. Auto-route to a cheaper model when a threshold trips.
Right now you find out from a screenshot on Twitter.
A heuristic scorer runs on every response in under a millisecond — hedge ratio, repetition, overconfidence, length anomalies. An LLM judge samples only the flagged spans, so cost stays near zero.
By the time it is an incident, it is already too late.
Because whyllm sits inline as a proxy, every prediction can become an in-request action. Cost forecasts, rate-limit ETAs, and model-drift alerts surface on a schedule — before they turn into outages.
Premium · AI copilots
Sampled, summarised observability data can't power real AI features — there's nothing for a model to read. whyllm keeps every call verbatim, so these premium copilots reason over the whole record. One is in beta today; the rest are on the roadmap.
Search your whole history in plain English. “Which prompts regressed after the gpt-4o snapshot bump?” returns an answer grounded in real spans — every one cited and clickable.
When drift or a cost spike fires, an agent walks the surrounding spans and writes the post-mortem: what changed, which calls were hit, and the likely fix.
A model judge re-scores responses for faithfulness, correctness and tone — running only on heuristically-flagged spans, so the bill stays near zero.
Refusals, errors and hallucinations grouped by meaning, not string match — 200 broken rows collapse into the five bugs actually behind them.
Your production traffic becomes your test suite. whyllm mines real spans into a golden regression set — no manual labelling, no synthetic data.
Analyses every span for an endpoint and proposes a shorter, cheaper prompt that holds quality — then A/B-tests it inline through the proxy.
Available on Pro and Enterprise — see the plan breakdown below.
Plans
Same proxy, same full capture on every plan. Higher tiers unlock enforcement, AI copilots, and scale.
Repoint one base URL and the next call you make is captured whole — all 49 fields.
Start monitoring free →