What we show

We don’t sample.We don’t summarize.We keep the whole call.

Every request through the whyllm proxy becomes one immutable span — 49 indexed fields across 9 categories, with the raw request and response bodies stored verbatim. Here is every single one.

capture categories

indexed fields

sampled — we keep all of it

100%

captured from request #1

Anatomy of a captured call

Pick a category. See every field.

This is the real span schema — not a marketing summary. Tap any tile to inspect exactly what whyllm records.

span · request9 fields

provider

"openai"

Upstream the call was routed to

model

"gpt-4o"

Requested model, canonicalized across dated snapshots

request.messages

[ {role, content}, … ]

The full message array, stored verbatim

request.system

"You are…" + sha256

System prompt kept separate, hashed for version tracking

request.temperature

0.7

Sampling parameters exactly as sent

request.max_tokens

1024

Generation ceiling requested

request.tools

[ {name, schema}, … ]

Tool / function definitions offered to the model

request.stream

true

Whether the response was streamed

headers

user-agent · sdk_version · …

Request headers, including custom X-whyllm-Tag-*

Open by default

Speaks the standard. No lock-in.

whyllm's span schema is aligned with the OpenTelemetry GenAI semantic conventions — the fields map straight across, so your data stays portable. Export it, pipe it into any OTel backend, leave whenever you want.

model→gen_ai.request.model

input_tokens→gen_ai.usage.input_tokens

finish_reason→gen_ai.response.finish_reasons

provider→gen_ai.provider.name

Captures calls to every major provider

OpenAIAzure OpenAIAnthropicAWS BedrockGoogle Vertex AIMistralany OpenAI-compatible endpoint

Feature tour

What that data turns into

Capture is the start. These four pillars are what whyllm does with it.

Tracing

Full-stack tracing

You can’t debug a call you never saw.

Every LLM call becomes a searchable span, and multi-step agent calls are stitched into one trace. Filter by model, environment, user, or tag and land on the exact request in two clicks.

›Raw request and response bodies, kept verbatim
›Agent chains reconstructed via trace_id and parent_span_id
›Full-text search across your entire history

trace · 3 spans

gpt-4o$0.00421.8s✓

text-embedding-3-large$0.00010.1s✓

claude-sonnet-4-6$0.00912.2s⚠

Cost

Cost control & budgets

The invoice should never be the first you hear of it.

Not just charts — enforcement. Set budgets per project, key, or user; the proxy returns HTTP 429 before an over-budget call ever reaches the provider. Auto-route to a cheaper model when a threshold trips.

›Hard budget caps enforced inline, at the proxy
›Spend attributed by feature, user, model, or tag
›Auto-routing to a cheaper model on threshold

monthly budget

$127.40 / $300.00

HTTP 429 · budget cap enforced

Quality

Hallucination detection

Right now you find out from a screenshot on Twitter.

A heuristic scorer runs on every response in under a millisecond — hedge ratio, repetition, overconfidence, length anomalies. An LLM judge samples only the flagged spans, so cost stays near zero.

›A 0–1 confidence score on 100% of responses
›Heuristic flags pinpoint why a response looks off
›LLM-as-judge confirmation on flagged spans only

hallucination score

0.18

token_repetitionhedge_mismatch

Prediction

Prediction & control plane

By the time it is an incident, it is already too late.

Because whyllm sits inline as a proxy, every prediction can become an in-request action. Cost forecasts, rate-limit ETAs, and model-drift alerts surface on a schedule — before they turn into outages.

›Month-end spend projected from week-over-week burn
›Rate-limit ETA — know the ceiling before you hit it
›Model drift caught against a rolling 7-day baseline

insight · cost forecast

Budget overrun likely

$420 month-to-date → $612 projected by May 31

severity: warning · confidence: 0.86

Premium · AI copilots

AI that works because nothing was thrown away.

Sampled, summarised observability data can't power real AI features — there's nothing for a model to read. whyllm keeps every call verbatim, so these premium copilots reason over the whole record. One is in beta today; the rest are on the roadmap.

RoadmapPro

Ask your traces

Search your whole history in plain English. “Which prompts regressed after the gpt-4o snapshot bump?” returns an answer grounded in real spans — every one cited and clickable.

Runs on

request.messages · response · tags

RoadmapPro

Root-cause copilot

When drift or a cost spike fires, an agent walks the surrounding spans and writes the post-mortem: what changed, which calls were hit, and the likely fix.

Runs on

trace_id · model_drift · system_fingerprint

BetaPro

LLM-judge evaluations

A model judge re-scores responses for faithfulness, correctness and tone — running only on heuristically-flagged spans, so the bill stays near zero.

Runs on

hallucination_flags · response

RoadmapPro

Semantic failure clustering

Refusals, errors and hallucinations grouped by meaning, not string match — 200 broken rows collapse into the five bugs actually behind them.

Runs on

refusal · error_type · response

RoadmapEnterprise

Auto-built eval suites

Your production traffic becomes your test suite. whyllm mines real spans into a golden regression set — no manual labelling, no synthetic data.

Runs on

request · response · llm_judge

RoadmapEnterprise

Prompt optimizer

Analyses every span for an endpoint and proposes a shorter, cheaper prompt that holds quality — then A/B-tests it inline through the proxy.

Runs on

request.system · cost_usd · hallucination_score

Available on Pro and Enterprise — see the plan breakdown below.

Plans

What you get on each plan

Same proxy, same full capture on every plan. Higher tiers unlock enforcement, AI copilots, and scale.

Basic

$0forever

Pro

$10/ month

Enterprise

Custom

Capture & history

Calls captured

100%

Spans / month

50,000

Unlimited

History retention

7 days

90 days

Custom

Projects

Unlimited

Raw request / response bodies

✓

Intelligence

Cost & token dashboard

✓

Hallucination scoring

—

✓

Prediction insights

—

✓

Budget enforcement (HTTP 429)

—

✓

Alerts & webhooks

—

✓

AI copilots

LLM-judge evaluations

—

✓

Ask your traces — natural-language search

—

✓

Root-cause copilot

—

✓

Semantic failure clustering

—

✓

Auto-built eval suites

—

✓

Prompt optimizer

—

✓

Scale & security

SSO / SAML

—

✓

Self-hosted option

—

✓

SLA guarantee

—

✓

Support

Community

Dedicated

Start free on Basic →See full pricing

Stop guessing what your LLMs did.

Repoint one base URL and the next call you make is captured whole — all 49 fields.

Start monitoring free →