Your OpenAI bill is not lying to you. It shows you exactly what you spent. What it doesn't show you is why — which requests were wasteful, which of your features is sending GPT-5 calls it doesn't need, and how much you'd save by making three specific changes this week.

This guide breaks down the real per-request cost of every major LLM API in 2026, the math behind why teams routinely overspend by 40–60%, and the five changes that bring costs back in line.

Quick note on pricing

LLM API prices change frequently. All figures in this article reflect published rates as of March 2026. Check each provider's pricing page before making decisions. Prices are shown per 1 million tokens unless noted.

The Complete LLM API Pricing Table (2026)

Every major hosted model, ranked by input cost. The ratio between cheapest and most expensive is 150x — which means model choice is the single biggest lever in your AI budget.

Model Provider Input / 1M tokens Output / 1M tokens Context
Gemini 2.0 Flash Google $0.10 $0.40 1M
GPT-4o mini OpenAI $0.15 $0.60 128K
GPT-5 Mini OpenAI $0.25 $1.00 128K
Claude Haiku 4 Anthropic $0.80 $4.00 200K
Llama 3.3 70B (Groq) Groq $0.59 $0.79 128K
Gemini 2.0 Pro Google $1.25 $5.00 1M
GPT-5 OpenAI $1.25 $5.00 128K
Mistral Large Mistral $2.00 $6.00 128K
GPT-4o OpenAI $2.50 $10.00 128K
Claude Sonnet 4 Anthropic $3.00 $15.00 200K
Claude Opus 4 Anthropic $15.00 $75.00 200K

The gap between Claude Opus ($15/1M input) and Gemini Flash ($0.10/1M) is 150x. That's not a rounding error — it's the difference between a $10K/month AI bill and a $67/month one, for the same number of requests.

What the Pricing Table Doesn't Tell You

The table above shows list prices. Here's what makes real-world costs higher — and harder to predict.

Output tokens cost 3–5x more than input tokens

This is the one that surprises teams most. GPT-5 charges $1.25/1M for input but $5.00/1M for output. If your app generates long responses — summaries, code, reports — output cost dominates your bill. A response that averages 800 output tokens costs 4x more to generate than a 200-token answer, even if the prompt is identical.

The fix: Audit your average output length by feature. Adding a max_tokens constraint to chatty features is the fastest cost reduction with the lowest risk.

Your system prompt runs on every single request

A 1,000-token system prompt at 50,000 daily requests costs:

Teams add instructions to system prompts without removing old ones. After six months, a 400-token system prompt has grown to 2,000 tokens. That's an $8,750/month increase at GPT-5 pricing — from copy-pasting instructions nobody audited.

Caching cuts costs 60–95%, but isn't automatic

OpenAI's prompt caching discounts repeated prefixes by 50–75%. Anthropic's prompt caching offers up to 90% reduction on cached content. But you have to structure your prompts to take advantage of it — by keeping static content at the start of prompts where the cache can hit it.

Most teams send requests where the static system prompt is mixed with dynamic user content, defeating caching. Restructuring to [static_system_prompt] + [dynamic_user_message] is a one-hour engineering change that can cut 40–60% of your bill immediately.

Real-World Cost Scenarios

Abstract per-token prices are hard to reason about. Here's what different workloads actually cost across model tiers, at 10,000 requests/day (assuming 500 input + 300 output tokens per request).

Customer support chatbot

Claude Opus 4$9,450/mo
Claude Sonnet 4$1,800/mo
GPT-5$645/mo
Claude Haiku 4$480/mo
GPT-5 Mini$105/mo

Document classification

GPT-4o$1,500/mo
GPT-5$645/mo
GPT-4o mini$90/mo
GPT-5 Mini$105/mo
Gemini 2.0 Flash$60/mo

Code review assistant

Claude Opus 4$9,450/mo
Claude Sonnet 4$1,800/mo
GPT-5$645/mo
GPT-5 Mini$105/mo

Embedding generation

text-embedding-3-large$130/mo
text-embedding-3-small$26/mo
Open-source (hosted)$15/mo

The document classification example is the most instructive: GPT-4o at $1,500/month vs. Gemini 2.0 Flash at $60/month — a 25x cost difference for the same task volume. If accuracy is equivalent (and for simple classification, it usually is), that's $17,280/year left on the table.

Why Teams Overspend (The Real Reasons)

It's not that engineers don't care about cost. It's that the feedback loop is broken. Here's the pattern we see repeatedly:

  1. Feature built on GPT-4o during prototyping — the smart choice when you don't know what accuracy you need yet.
  2. Feature ships. Nobody revisits the model choice. The engineering team is onto the next sprint.
  3. Usage grows. The cost grows proportionally, but nobody has a breakdown by feature to see that Feature X is responsible for 60% of the bill.
  4. Bill becomes a line item on someone's P&L. The Slack message goes out: "Be mindful of LLM usage." Nobody knows what to cut.

The problem isn't the decision to use GPT-4o. It's the lack of a feedback loop that would tell you when GPT-5 Mini handles the task at equivalent quality — and at 5% of the cost.

5 Ways to Cut Your LLM Bill Right Now

1. Match model to task complexity

Most production apps have a mix of tasks: some require deep reasoning (use the frontier model), most do not (use the mini tier). The rule of thumb: if a task can be described in a sentence and the correct output is unambiguous, a mini model handles it. Run both models on 100 real examples from your production logs and compare outputs before you change anything.

2. Cache repeated prompts

If you send the same prompt — or prompts with the same prefix — multiple times, you're paying full price each time. Structure prompts so static content comes first, enabling provider-side caching. For identical requests (same user asking the same thing), add a semantic similarity check before the API call. Even a 30% cache hit rate cuts your bill by nearly a third.

3. Audit and trim system prompts

Open your most-used prompt. Count the tokens. Now calculate: tokens × daily requests × 30 × price per token. That number will motivate a prompt audit faster than anything else. Remove redundant instructions, consolidate formatting rules, and cut examples you added "just in case."

4. Set hard budget limits, not Slack alerts

Slack messages change behavior for one sprint. Hard limits prevent surprise bills entirely. Set workspace-level and feature-level budget caps at your proxy layer — so when a new feature starts burning 10x more than expected, it gets flagged before it hits your invoice.

5. Track cost by feature, not just total

Your OpenAI dashboard shows total spend. That's like knowing your team's total salary budget without knowing which team members are allocated to which projects. Tag every LLM request with the feature that triggered it. Once you can see that Feature A costs $3,200/month and Feature B costs $180/month, the optimization decisions become obvious.

The uncomfortable truth

Most teams can cut 40–60% of their LLM bill without touching their product quality. The savings are sitting in model choices nobody revisited, system prompts nobody audited, and duplicate requests nobody cached. The bottleneck isn't knowing what to do — it's knowing where to look.

How to Find the Savings in Your Own Traffic

The tactics above are straightforward. The hard part is applying them to your specific codebase, where you have 40+ touchpoints calling the API across a dozen features, and no per-feature cost breakdown.

This is the problem Preto.ai solves. It sits between your app and your LLM provider as a transparent proxy — one URL change, no SDK, no refactor. Every request gets logged with cost, model, latency, and which feature triggered it. Within 24 hours, you see ranked recommendations: "Switch Feature X from GPT-5 to GPT-5 Mini — $1,240/month projected savings. 97% accuracy match on your last 2,300 requests."

Then it tracks whether you actually captured the saving.

Frequently Asked Questions

Which LLM API is cheapest in 2026?
For raw price, Gemini 2.0 Flash ($0.10/1M input) and GPT-4o mini ($0.15/1M) are among the cheapest hosted options. Self-hosted Llama 3 can be cheaper at scale but adds infrastructure overhead. The cheapest API depends on your task: mini models handle classification and extraction well, while complex reasoning may need a frontier model. Always benchmark accuracy before cutting cost.
How much does GPT-5 cost per request?
GPT-5 costs $1.25/1M input tokens and $5.00/1M output tokens. A typical request with 500 input tokens and 300 output tokens costs roughly $0.00213. At 10,000 requests/day, that's about $639/month — compared to $128/month for the same volume on GPT-5 Mini.
Why is my OpenAI bill higher than expected?
The three most common reasons: (1) Output tokens cost 3–5x more than input, and long responses add up fast. (2) System prompts repeat on every request — a 1,000-token system prompt at 50K daily requests costs $1,875/month at GPT-5 pricing before any actual user content. (3) Using a frontier model for tasks a mini model handles equally well.
Is Claude cheaper than GPT-5?
It depends on the tier. Claude Haiku 4 ($0.80/1M input) is more expensive than GPT-5 Mini ($0.25/1M) but cheaper than GPT-5 ($1.25/1M). Claude Sonnet 4 ($3.00/1M) and Opus 4 ($15.00/1M) are significantly more expensive than any GPT-5 tier. The right model depends on task requirements, not just headline price.
How can I reduce my LLM API costs without hurting quality?
The highest-impact changes: (1) Audit which features use which models — most apps use frontier models everywhere when only a few features need it. (2) Restructure prompts to enable provider caching on static content. (3) Trim system prompts — every 100 tokens removed from a 50K/day prompt saves ~$188/month at GPT-5 pricing. (4) Set hard budget limits at the proxy layer, not just Slack alerts. (5) Track cost by feature so you know where to look.

Find out where your AI budget is actually going.

Preto.ai tracks every LLM request by feature, ranks your top cost-cutting opportunities, and tells you exactly how much each change saves. One URL change to set up.

Start Free — 10K Requests Included →

No credit card. No SDK. Works with OpenAI, Anthropic, and NVIDIA.