May 3, 2026
The Cost-Latency-Quality Triangle for LLM Features
Every production LLM feature picks two of three: cheap, fast, or good. Most teams pretend they're picking all three, then discover at scale that they picked one. A practitioner's framework for making the trade-off explicit, with the numbers that actually drive the decision.
Every production LLM feature lives inside a triangle. The three corners are cost per inference, latency to first token (or to completion), and quality of the output for the actual use case. The triangle is real, the trade-offs are real, and the teams that ship LLM features successfully are the ones that decide which corner to give up before they start building, not after they get the bill or the customer complaint.
Most teams pick implicitly, by accident, by accepting whatever the easiest path produced. A founder ships a feature using the most capable available model at default temperature with no caching, the demo works, the launch happens, and three weeks later there’s a meeting about the API bill or the user complaints about response time. At that point the team is reverse-engineering trade-offs they should have made up front, under pressure, with a feature already in production users’ hands.
This is a framework for making the trade-off explicit. It’s also a working set of numbers that, while they will move as the model market moves, are the right shape of numbers to reason with as of mid-2026.
The three corners
Cost is the per-inference dollar figure: input tokens × input price + output tokens × output price. For a typical Claude or GPT-class feature with ~1500 input tokens and ~500 output tokens at frontier-model rates, cost per inference is currently in the $0.005–$0.05 range depending on model.12 Multiply by your monthly inference count to get the actual line item. If your feature processes 1M user requests per month, the gap between a frontier model and a smaller one is the difference between a $50,000/yr line item and a $5,000/yr line item — at the same usage. Cost matters once you have meaningful volume; until then it’s noise.
Latency is time-to-first-token (TTFT) and time-to-completion (TTC). For interactive features (chat, in-app assistance, real-time triage), TTFT is what your users perceive — under ~500ms feels responsive, ~1s feels like the model is thinking, >2s feels broken.3 For background features (batch summarization, async classification, overnight enrichment), TTC is what matters and it’s typically in the 5–60s range for a single response. Streaming hides latency for interactive features but doesn’t reduce it for synchronous ones.
Quality is the hardest to measure. It’s whatever your eval harness scores against your test set — the rate at which the model produces outputs that pass your scorers.4 Quality is use-case-specific: a model that scores 95% on your customer-support classifier may score 60% on your contract-clause extractor. There is no universal quality number; there is only quality for your test set.
Pick two
You cannot maximize all three. Every architectural choice for an LLM feature trades one corner against the other two. The fast way to internalize this is to look at the four most common trades:
Pick fast + cheap (give up quality). Use a smaller, faster model — Haiku-class, GPT-4-mini-class, an open-weight model self-hosted. Skip retrieval. Don’t reason. You’ll get sub-second latency and per-inference cost a tenth of frontier rates. Quality will drop on tasks that require nuance: the model will get the obvious cases right and miss the subtle ones. This is the right trade for high-volume classification, simple extraction, and triage where wrong-but-cheap is recoverable.
Pick fast + good (give up cheap). Use a frontier model with optimized inputs — short prompts, no preamble, structured output via tool-use rather than free-form generation. Use prompt caching aggressively to avoid re-paying for the same system prompt on every call.5 You’ll get strong quality and acceptable latency, and you’ll pay for it linearly with usage. This is the right trade for high-stakes interactive features where wrong answers are expensive: customer-support responses with policy implications, medical or legal draft generation, agent tool-use decisions.
Pick cheap + good (give up fast). Use a frontier model but in batch or async mode. Run inferences during off-peak hours, take advantage of batch-API discounts (typically 50% on Anthropic’s batch tier and OpenAI’s batch tier),12 and accept that responses are minutes-to-hours, not milliseconds. This is the right trade for offline enrichment, overnight summarization, and any pipeline where the user doesn’t sit waiting for output.
Pick all three (you can’t, but you can approximate). The closest you’ll come is a routed architecture: a small fast model triages the request, decides which of answer directly, escalate to frontier, batch for offline applies, and only the genuinely hard subset hits the expensive path. This is more architecture than feature work, and it requires the eval harness to validate the router’s accuracy. You’re not getting all three; you’re paying once-up-front in engineering work to narrow the set of requests that have to make the trade-off.
The numbers, mid-2026
For grounding, here are the rough numbers to reason with for the model classes shipping today. These will move; the relationships are what matter.
| Model class | Input $/M tokens | Output $/M tokens | TTFT (typical) | Best for |
|---|---|---|---|---|
| Frontier (Opus, GPT-5-class) | $15 | $75 | 800–1500ms | High-stakes reasoning, agent tool-use, long-context |
| Mid (Sonnet, GPT-4-class) | $3 | $15 | 400–800ms | Most production user-facing features |
| Small (Haiku, GPT-4-mini) | $0.25–$1 | $1.25–$5 | 200–500ms | Classification, triage, high-volume extraction |
| Self-hosted (Llama, Mistral, Qwen) | (compute) | (compute) | varies | Volume features where data residency or unit economics dominate |
Source rates from current Anthropic and OpenAI public pricing pages.12 Self-hosted economics depend on infrastructure choices that vary widely.
The two reductions every team should know:
- Prompt caching can drop input cost on a long stable system prompt by ~90% on subsequent calls, with no quality impact.5 If your feature uses a 2000-token system prompt and you call it 100,000 times a month, prompt caching pays for itself in the first hour of work to set up.
- Structured outputs via tool-use typically reduce output token count by 30–60% versus free-form natural language responses, because you’re forcing the model into a JSON schema rather than letting it generate prose. Lower output tokens, lower cost, faster completion, and easier validation downstream.
What goes wrong when teams don’t pick
The failure modes are predictable. Three patterns I see consistently:
The “we’ll optimize later” trap. The team ships with the most capable model at default settings, traffic ramps, and the cost line item gets noticed by finance. Now they’re trying to migrate to a smaller model under pressure, with the feature already in production, no eval harness to validate quality preservation, and a customer base who’ll notice if quality drops. The migration takes 6 weeks and ships at slightly worse quality than the original — but they could have shipped at equivalent quality if they’d designed for it from week one.
The “frontier-everything” trap. The team uses a frontier model for tasks the model is overqualified for: extracting a date from a string, classifying a support ticket into one of five buckets, validating that JSON has the right shape. Each individual cost is small, but at volume the bill is dominated by tasks a smaller model would have done at 1/30th the cost with no perceptible quality difference. The fix is a routed architecture — but it’s much harder to retrofit than to build.
The “no eval harness” trap. The team has no way to measure whether moving from one model to another preserved quality, so every model migration is a leap of faith and tends not to happen. The team stays on whatever they shipped with, regardless of whether it’s the right corner of the triangle. The eval harness is what enables the cost-quality trade-off to be made deliberately rather than not at all.4
What “right” looks like
A team that’s picking deliberately has, before launch:
- Named the feature’s primary corner. Chat in a customer-facing app: fast + good, give up cheap. Overnight summarization of 100,000 documents: cheap + good, give up fast. Real-time fraud triage: fast + cheap, give up some quality. Write it down in the design doc, not in someone’s head.
- Measured the actual numbers. Run a representative test set through 2–3 candidate model+config combinations and produce a table with cost per inference, p50 / p95 latency, and eval harness score for each. Make the trade-off explicit before shipping.
- Set an alert on the corner you sacrificed. If you gave up cost, alert when the monthly bill projection crosses a threshold. If you gave up latency, alert when p95 TTFT exceeds your tolerance. If you gave up quality, alert when the nightly eval score drops below baseline. The corner you sacrificed is the one most likely to bite you; instrument it.
- Built the architecture to support the next trade-off. Today the feature might be fast+good. In 18 months, with 100x usage, it might need to become cheap+good (with batch fallback for non-interactive paths). Architectures that make the second trade-off cheap to execute — clean abstraction over the model layer, a router stub even if it’s a no-op, prompt caching from day one — preserve the optionality.
This is unglamorous work. None of it makes a demo more impressive. All of it determines whether the feature is operating in production at acceptable economics 18 months in, or whether the team is in the conference room having the migration meeting.
Where to put the time
If you’re scoping an LLM feature right now and the corner-picking conversation hasn’t happened, that’s the conversation to have first — before the prompt, before the integration, before the eval. Most of the architecture decisions follow from it. The Foundation engagement at Mastascusa Holdings is two weeks of doing exactly this work end-to-end: stack selection per the corner you’re picking, eval harness against the resulting model, prompt caching and structured outputs configured, monitoring and cost alerts in place, deployment doc written so your team can extend the pattern to the next feature.
Talk to me about scoping if you want a senior pair of hands on the corner-picking conversation. Or read /build for the price and the deliverable.
Sources
Footnotes
-
Anthropic, Pricing — current published rates for Claude Opus 4.7, Sonnet 4.6, and Haiku 4.5 input and output tokens, plus batch and prompt-caching discounts. https://www.anthropic.com/pricing ↩ ↩2 ↩3
-
OpenAI, Pricing — current published rates for GPT-5 and GPT-4-class models including batch-API discounts. https://openai.com/api/pricing/ ↩ ↩2 ↩3
-
Jakob Nielsen, Response Times: The 3 Important Limits — the canonical user-perception thresholds for interface responsiveness (~100ms instant, ~1s flow-preserving, ~10s attention loss), still used as the baseline for interactive system design. https://www.nngroup.com/articles/response-times-3-important-limits/ ↩
-
For what an eval harness actually contains and how it gates model changes, see “What an AI Eval Harness Actually Looks Like in Production” — companion piece to this article. ↩ ↩2
-
Anthropic, Prompt caching — official documentation explaining cache write/read pricing and how to structure long stable prompts to maximize cache hits. https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching ↩ ↩2