What we show

We don’t sample.We don’t summarize.We keep the whole call.

Every request through the whyllm proxy becomes one immutable span — 49 indexed fields across 9 categories, with the raw request and response bodies stored verbatim. Here is every single one.

9
capture categories
49
indexed fields
0%
sampled — we keep all of it
100%
captured from request #1

Anatomy of a captured call

Pick a category. See every field.

This is the real span schema — not a marketing summary. Tap any tile to inspect exactly what whyllm records.

span · request9 fields
provider
"openai"
Upstream the call was routed to
model
"gpt-4o"
Requested model, canonicalized across dated snapshots
request.messages
[ {role, content}, … ]
The full message array, stored verbatim
request.system
"You are…" + sha256
System prompt kept separate, hashed for version tracking
request.temperature
0.7
Sampling parameters exactly as sent
request.max_tokens
1024
Generation ceiling requested
request.tools
[ {name, schema}, … ]
Tool / function definitions offered to the model
request.stream
true
Whether the response was streamed
headers
user-agent · sdk_version · …
Request headers, including custom X-whyllm-Tag-*

Open by default

Speaks the standard. No lock-in.

whyllm's span schema is aligned with the OpenTelemetry GenAI semantic conventions — the fields map straight across, so your data stays portable. Export it, pipe it into any OTel backend, leave whenever you want.

modelgen_ai.request.model
input_tokensgen_ai.usage.input_tokens
finish_reasongen_ai.response.finish_reasons
providergen_ai.provider.name

Captures calls to every major provider

OpenAIAzure OpenAIAnthropicAWS BedrockGoogle Vertex AIMistralany OpenAI-compatible endpoint

Feature tour

What that data turns into

Capture is the start. These four pillars are what whyllm does with it.

Tracing

Full-stack tracing

You can’t debug a call you never saw.

Every LLM call becomes a searchable span, and multi-step agent calls are stitched into one trace. Filter by model, environment, user, or tag and land on the exact request in two clicks.

  • Raw request and response bodies, kept verbatim
  • Agent chains reconstructed via trace_id and parent_span_id
  • Full-text search across your entire history
trace · 3 spans
gpt-4o$0.00421.8s
text-embedding-3-large$0.00010.1s
claude-sonnet-4-6$0.00912.2s
Cost

Cost control & budgets

The invoice should never be the first you hear of it.

Not just charts — enforcement. Set budgets per project, key, or user; the proxy returns HTTP 429 before an over-budget call ever reaches the provider. Auto-route to a cheaper model when a threshold trips.

  • Hard budget caps enforced inline, at the proxy
  • Spend attributed by feature, user, model, or tag
  • Auto-routing to a cheaper model on threshold
monthly budget
$127.40 / $300.00
HTTP 429 · budget cap enforced
Quality

Hallucination detection

Right now you find out from a screenshot on Twitter.

A heuristic scorer runs on every response in under a millisecond — hedge ratio, repetition, overconfidence, length anomalies. An LLM judge samples only the flagged spans, so cost stays near zero.

  • A 0–1 confidence score on 100% of responses
  • Heuristic flags pinpoint why a response looks off
  • LLM-as-judge confirmation on flagged spans only
hallucination score
0.18
token_repetitionhedge_mismatch
Prediction

Prediction & control plane

By the time it is an incident, it is already too late.

Because whyllm sits inline as a proxy, every prediction can become an in-request action. Cost forecasts, rate-limit ETAs, and model-drift alerts surface on a schedule — before they turn into outages.

  • Month-end spend projected from week-over-week burn
  • Rate-limit ETA — know the ceiling before you hit it
  • Model drift caught against a rolling 7-day baseline
insight · cost forecast
Budget overrun likely
$420 month-to-date → $612 projected by May 31
severity: warning · confidence: 0.86

Premium · AI copilots

AI that works because nothing was thrown away.

Sampled, summarised observability data can't power real AI features — there's nothing for a model to read. whyllm keeps every call verbatim, so these premium copilots reason over the whole record. One is in beta today; the rest are on the roadmap.

RoadmapPro

Ask your traces

Search your whole history in plain English. “Which prompts regressed after the gpt-4o snapshot bump?” returns an answer grounded in real spans — every one cited and clickable.

Runs on
request.messages · response · tags
RoadmapPro

Root-cause copilot

When drift or a cost spike fires, an agent walks the surrounding spans and writes the post-mortem: what changed, which calls were hit, and the likely fix.

Runs on
trace_id · model_drift · system_fingerprint
BetaPro

LLM-judge evaluations

A model judge re-scores responses for faithfulness, correctness and tone — running only on heuristically-flagged spans, so the bill stays near zero.

Runs on
hallucination_flags · response
RoadmapPro

Semantic failure clustering

Refusals, errors and hallucinations grouped by meaning, not string match — 200 broken rows collapse into the five bugs actually behind them.

Runs on
refusal · error_type · response
RoadmapEnterprise

Auto-built eval suites

Your production traffic becomes your test suite. whyllm mines real spans into a golden regression set — no manual labelling, no synthetic data.

Runs on
request · response · llm_judge
RoadmapEnterprise

Prompt optimizer

Analyses every span for an endpoint and proposes a shorter, cheaper prompt that holds quality — then A/B-tests it inline through the proxy.

Runs on
request.system · cost_usd · hallucination_score

Available on Pro and Enterprise — see the plan breakdown below.

Plans

What you get on each plan

Same proxy, same full capture on every plan. Higher tiers unlock enforcement, AI copilots, and scale.

Basic
$0forever
Pro
$10/ month
Enterprise
Custom
Capture & history
Calls captured
100%
100%
100%
Spans / month
50,000
Unlimited
Unlimited
History retention
7 days
90 days
Custom
Projects
1
Unlimited
Unlimited
Raw request / response bodies
Intelligence
Cost & token dashboard
Hallucination scoring
Prediction insights
Budget enforcement (HTTP 429)
Alerts & webhooks
AI copilots
LLM-judge evaluations
Ask your traces — natural-language search
Root-cause copilot
Semantic failure clustering
Auto-built eval suites
Prompt optimizer
Scale & security
SSO / SAML
Self-hosted option
SLA guarantee
Support
Support
Community
Email
Dedicated

Stop guessing what your LLMs did.

Repoint one base URL and the next call you make is captured whole — all 49 fields.

Start monitoring free →