Your LLMs areburning money.You just can't see it.

Real-time visibility into every LLM call. Repoint one base URL through the whyllm proxy — your code, your keys, your SDK never change.

Start monitoring free →▶ See how it works

2min

to full observability

1line

to repoint your base URL

<50ms

added latency

scroll

Integration

One line in. Everything out.

whyllm runs as an L7 proxy in front of your model provider. Repoint your base URL, add one header — your code, your keys, your SDK all stay exactly as they are. Every request flows through and is captured whole: prompt, response, every header, exact timing.

Point your base URL at whyllm

base_url = "https://proxy.whyllm.io/v1"

Pass your whyllm key as a header

X-whyllm-Key: wl-prod_...

Ship it — your code never moves

Every call traced, costed, scored

OpenAI

python

█

The official OpenAI SDK — or any OpenAI-compatible client. Two fields change, nothing else.

Compatible with every major LLM provider

OpenAIAnthropicAzure OpenAIAWS BedrockGoogle Vertex AIMistralCohereGroqTogether AIPerplexityOllamaFireworks AI

The problem

Right now, you're flying blind

Every day without observability is money you can't recover and quality issues you can't explain.

Without whyllm

Invoice arrives. You had no idea the bill would be this high.

✓

With whyllm

›Proxy-layer budget cap — HTTP 429 before the call fires
›GPT-5.4 surcharge alert: doubles $2.50→$5.00/M past 272K context
›Auto-routes to gpt-5.4-mini when a project threshold trips

Without whyllm

A user screenshots a hallucinated response. You find out on Twitter.

✓

With whyllm

›<1ms heuristic scorer: hedge ratio, factual anchoring, refusal patterns
›Flags outputs below confidence threshold pre-response, not post
›LLM judge fires only on flagged spans — cost stays near zero

Without whyllm

You tried three observability tools. Each took days of SDK migration and half your prompts still weren't captured.

✓

With whyllm

›One base-URL change — whyllm sits inline as an L7 proxy
›Captures raw wire data: prompt, response, every header, exact timing
›100% capture rate from request #1 — nothing to instrument

Without whyllm

Your app feels slow. You blamed the database for a week. It was a 4-second LLM call.

✓

With whyllm

›TTFT, generation time, total latency tracked per model × route × user_id
›P95 spike on any endpoint? Drill to exact calls in 2 clicks
›Prompt length, model version, and timestamp all indexed

Without whyllm

Which feature is burning $3k/month? You have spreadsheets, guesses, and an angry CFO.

✓

With whyllm

›Tag calls with feature, user_id, session via request headers
›Filter spend by any dimension in the dashboard
›/summarize = $0.0034/call × 8,200/day — know it before CFO asks

Without whyllm

You shipped a new prompt. Engagement dropped. You can't tell if the prompt caused it.

✓

With whyllm

›SHA-256 content hash stored per system prompt version
›Quality score delta auto-computed across versions
›Regression surfaced same deploy — not same week

Without whyllm

Some calls send 50k tokens of context. Most only need 500. You're paying 100× too much.

✓

With whyllm

›Histogram: prompt_tokens vs completion_tokens per endpoint
›P99 context size + estimated monthly waste in dollars
›Alerts when GPT-5.4 crosses 272K surcharge boundary (rate doubles)

Without whyllm

Legal asks for every prompt that touched customer PII last quarter. Your answer: silence.

✓

With whyllm

›Append-only immutable span log — full prompt and response bodies
›Filter by date, model, user_id, or regex pattern
›CSV export or REST API — satisfies SOC2 and GDPR

Without whyllm

You're hitting rate limits in prod. You find out when users see 500 errors at 2am.

✓

With whyllm

›RPM tracked vs your tier limit in real time
›PagerDuty/Slack alert fires at 80% utilisation
›Auto-fallback to secondary API key or queued retry — zero user impact

Without whyllm

You're running GPT-5.4 and Claude side by side but have no data on which performs better.

✓

With whyllm

›Traffic-split at proxy: GPT-5.4 ($2.50/M) vs Claude Sonnet 4.6 ($3/M) vs Gemini 2.5 Pro ($1.25/M)
›Compare cost_per_call, p95_latency, hallucination_rate with statistical significance
›Switch winner with one config line

Without whyllm

Invoice arrives. You had no idea the bill would be this high.

✓

With whyllm

›Proxy-layer budget cap — HTTP 429 before the call fires
›GPT-5.4 surcharge alert: doubles $2.50→$5.00/M past 272K context
›Auto-routes to gpt-5.4-mini when a project threshold trips

Without whyllm

A user screenshots a hallucinated response. You find out on Twitter.

✓

With whyllm

›<1ms heuristic scorer: hedge ratio, factual anchoring, refusal patterns
›Flags outputs below confidence threshold pre-response, not post
›LLM judge fires only on flagged spans — cost stays near zero

Without whyllm

You tried three observability tools. Each took days of SDK migration and half your prompts still weren't captured.

✓

With whyllm

›One base-URL change — whyllm sits inline as an L7 proxy
›Captures raw wire data: prompt, response, every header, exact timing
›100% capture rate from request #1 — nothing to instrument

Without whyllm

Your app feels slow. You blamed the database for a week. It was a 4-second LLM call.

✓

With whyllm

›TTFT, generation time, total latency tracked per model × route × user_id
›P95 spike on any endpoint? Drill to exact calls in 2 clicks
›Prompt length, model version, and timestamp all indexed

Without whyllm

Which feature is burning $3k/month? You have spreadsheets, guesses, and an angry CFO.

✓

With whyllm

›Tag calls with feature, user_id, session via request headers
›Filter spend by any dimension in the dashboard
›/summarize = $0.0034/call × 8,200/day — know it before CFO asks

Without whyllm

You shipped a new prompt. Engagement dropped. You can't tell if the prompt caused it.

✓

With whyllm

›SHA-256 content hash stored per system prompt version
›Quality score delta auto-computed across versions
›Regression surfaced same deploy — not same week

Without whyllm

Some calls send 50k tokens of context. Most only need 500. You're paying 100× too much.

✓

With whyllm

›Histogram: prompt_tokens vs completion_tokens per endpoint
›P99 context size + estimated monthly waste in dollars
›Alerts when GPT-5.4 crosses 272K surcharge boundary (rate doubles)

Without whyllm

Legal asks for every prompt that touched customer PII last quarter. Your answer: silence.

✓

With whyllm

›Append-only immutable span log — full prompt and response bodies
›Filter by date, model, user_id, or regex pattern
›CSV export or REST API — satisfies SOC2 and GDPR

Without whyllm

You're hitting rate limits in prod. You find out when users see 500 errors at 2am.

✓

With whyllm

›RPM tracked vs your tier limit in real time
›PagerDuty/Slack alert fires at 80% utilisation
›Auto-fallback to secondary API key or queued retry — zero user impact

Without whyllm

You're running GPT-5.4 and Claude side by side but have no data on which performs better.

✓

With whyllm

›Traffic-split at proxy: GPT-5.4 ($2.50/M) vs Claude Sonnet 4.6 ($3/M) vs Gemini 2.5 Pro ($1.25/M)
›Compare cost_per_call, p95_latency, hallucination_rate with statistical significance
›Switch winner with one config line

What you get

Three things no other tool
does well together

Full-stack tracing

Every LLM call captured — prompt, response, model, token count, latency. Filter by user, feature, or environment. Search your entire history in milliseconds.

OpenAI GPT-5.4 · Anthropic Claude Sonnet 4.6 · Gemini 2.5 Pro · AWS Bedrock · Azure OpenAI · any OpenAI-compatible endpoint

Cost control

Not just dashboards — actual enforcement. Set budgets per project, user, or API key. Auto-route to a cheaper model when a threshold hits. Kill switches included.

Real-time spend alerts · Auto-routing · Hard limits · Per-user budgets

Hallucination detection

Fast heuristics score every response for confidence, factual consistency, and refusal patterns. LLM-as-judge only fires on flagged spans — keeps cost near zero.

Sub-1ms heuristic pass · Sampled LLM judge · Confidence scores · Trend view

vs the competition

Tool

2-min setup

Cost control

Hallucination detection

Open source

Helicone

✓

—

LangSmith

—

Langfuse

—

✓

Arize

—

✓

—

whyllmyou

✓

The dashboard

Everything in one place

app.whyllm.io/dashboard

my-app

▦Overview

◈Traces

◎Cost

⚑Quality

◐Alerts

⊞Playground

Monthly budget

$127 / $300

↓ 12% vs last month

Overview

Last 30 days • Updated just now

Live

Total spend

$127.40

↓ 12%

API calls

48,291

↑ 8%

Avg latency

892ms

↓ 34ms

Hallucination rate

2.3%

↓ 0.8%

Daily API spend30d

ModelTokensCostLatencyScore

gpt-5.41,847$0.0051.2s✓ 98%

claude-sonnet-4-62,103$0.0090.9s⚠ 72%

gpt-5.4-mini934$0.0010.4s✓ 95%

Pricing

Simple. Flat-rate.
No per-seat nonsense.

One flat price for the whole team. A 10-person team shouldn't cost 10×.

Basic

$0forever

For solo devs and side projects

✓50k spans / month
✓7-day retention
✓Cost dashboard
✓1 project
✓Community support

Start free

Start monitoring in
2 minutes.

The engineers who wait find out about problems from their users.
The ones who ship win.

Get started free →

base_url → proxy.whyllm.io

Your LLMs areburning money.You just can't see it.

One line in. Everything out.

Right now, you're flying blind

Three things no other tooldoes well together

Full-stack tracing

Cost control

Hallucination detection

Everything in one place

Overview

Simple. Flat-rate.No per-seat nonsense.

Start monitoring in2 minutes.

Three things no other tool
does well together

Simple. Flat-rate.
No per-seat nonsense.

Start monitoring in
2 minutes.