The AI Eval Budget: What Reliability Actually Costs
There is a number making the rounds in boardrooms right now that nobody wants on their slide: 95% of generative AI pilots show no measurable profit-and-loss impact. Not disappointing impact. No impact. MIT researchers put that number on the table and the reaction across corporate America has been a slow-motion flinch. Uber’s COO said publicly that AI costs were harder to justify than the company expected. One company reportedly burned $500 million on AI usage in a single month. JPMorgan published a research note with a title that reads like a horror film: “AI Token Costs are Eating Internet Profits Alive.”
The popular diagnosis is sticker shock. AI got expensive, companies overbought, and now the CFOs are clawing it back.
I think that diagnosis is wrong, and I think it is wrong in a way that matters for every founder building on top of these models.
The companies in pain are not overpaying for AI. They are paying, mostly without knowing it, for unreliability. The wasted tokens, the abandoned pilots, the support tickets generated by a confidently wrong answer, the engineer-week spent debugging a regression that shipped silently when a model version changed. None of that shows up as a line item called “AI failure.” It shows up smeared across payroll, churn, refunds, and rework, which is exactly why nobody budgets for it.
There is a name for the discipline that converts that smear into a number you can see and manage. Evals. And there is a reason eval infrastructure is suddenly one of the hottest categories in AI: one evaluation platform hit a $1.7 billion valuation less than four months after launch, and at least 20 venture-funded eval startups are now competing for your money.
I have built two companies where AI does most of the production work, and I skipped evals for longer than I should have in both. This post is the budget conversation I wish someone had forced on me. Not how to write evals, I covered the mechanics in the evals playbook for solo founders. This one is about the money: what unreliability actually costs, what evals actually cost, and how to size the budget so the first number stops eating your margin.
Table of Contents
- The Tax You Are Already Paying
- The Two Taxes
- The Eval Budget Curve
- Why the Tax Stays Invisible
- The Eval Spend Ladder
- Sizing Your Eval Budget by Stage
- Build vs Buy in the Eval Gold Rush
- Running the Ledger on Two Public Failures
- The Contrarian Take: Evals Are a Pricing Decision
- What to Do Monday Morning
- FAQ
The Tax You Are Already Paying
Start with the scale of the thing, because it is bigger than most founders’ mental model.
By 2025, researchers tracking the cost of AI hallucinations across enterprises estimated the damage at $67.4 billion for 2024 alone, with projections above $112 billion for the following year. The breakdown is instructive: roughly $18.2 billion in direct losses, $21.5 billion in operational cleanup, and $27.7 billion in reputational damage. The cleanup and reputation lines are bigger than the direct losses. The second-order costs of a wrong answer outweigh the first-order ones.
That is the aggregate. The per-company version is uglier because it is concrete.
Air Canada’s support chatbot invented a bereavement fare policy that did not exist. A customer relied on it, the airline refused to honor it, and a tribunal ruled that the company was responsible for what its own bot said. A Chevrolet dealership’s chatbot agreed to sell a Tahoe for one dollar after a customer talked it into the deal. Deloitte’s Australian arm refunded part of a AU$440,000 government contract after its AI-assisted report shipped with roughly 20 fabricated errors, including citations to sources that did not exist. A lawyer for a high-profile client filed a brief with almost 30 defective citations, misquotes, and references to fictional cases.
Single incidents, you might say. Outliers. Except the base rates say otherwise. In one survey, 47% of enterprise AI users admitted they had made at least one major business decision based on hallucinated content. Legal AI tools purpose-built for accuracy, with retrieval pipelines designed to ground every answer, still hallucinate between 17% and 34% of the time. And the structural numbers are worse: 88% of AI proofs-of-concept never reach production, 42% of US companies abandoned most of their AI initiatives in 2025, up from 17% a year earlier, and enterprise agentic systems show roughly a 37% gap between their benchmark scores and their real-world performance.
Here is the founder translation. Every one of those statistics is money that left a building. A failed pilot is salary plus tokens plus opportunity cost. A hallucinated answer that reaches a customer is a refund, a churn event, or a court date. A regression that ships silently is an engineer-week of forensic debugging. The money is real and it is already leaving. The only question is whether it leaves through a line item you can see or through a thousand small holes you cannot.
I wrote about why AI agents fail in production from the architecture side. This is the same failure surface viewed from the accounting side. And from the accounting side, the conclusion is blunt: you do not get to choose whether you pay for AI reliability. You only get to choose which form the payment takes.
The Two Taxes
Every AI product pays one of two taxes.
The first is the reliability tax. It is what you pay when you ship without measurement: failures discovered by customers, regressions discovered in production, quality discovered by anecdote. It has three properties that make it lethal. It is unbounded, there is no ceiling on what a single bad answer can cost you, ask Air Canada. It is lagging, you learn the price weeks or months after the failure, when the churn shows up or the lawyer calls. And it compounds, because every untested change you ship stacks new failure modes on top of old ones.
The second is the eval budget. It is what you pay when you measure on purpose: a golden dataset that encodes what “correct” means for your product, a regression gate in CI, scored samples of production traffic. It has the opposite three properties. It is bounded, you decide the number in advance. It is leading, it tells you about failures before customers do. And it flattens, because the same eval suite that catches today’s regression keeps catching them on every future model swap.
That is the entire framework. Everything else in this post is arithmetic on top of it.
The reframe that matters: evals are not a cost you add to your AI budget. They are a conversion. You are converting an invisible, unbounded, lagging tax into a visible, bounded, leading one. Nobody describes insurance as wasted money because the house did not burn down this year. The eval budget is the only insurance product in AI that also makes the house less flammable.
If you have read my piece on cost-first AI product launches, this is the missing column from that ledger. COGS for an AI product is not just inference. It is inference plus the cost of being wrong, and only one of those two numbers is printed on an invoice.
The Eval Budget Curve
So how much should you spend? The honest answer is a curve, not a number.
Picture total reliability cost on the vertical axis: everything unreliability costs you plus everything you spend measuring it. On the horizontal axis, eval spend as a share of your total AI budget, from zero to half.
At zero, the failure tax dominates. You are the 88% whose proof-of-concept dies before production, or worse, the company whose failure modes ship and meet customers. As eval spend rises, the failure tax falls fast. Early eval dollars are the cheapest reliability you will ever buy: a 20-case golden set catches the embarrassing failures, a CI gate catches the regressions, scored production samples catch the drift.
But the curve does not fall forever. Past a point, each additional eval dollar catches a rarer, cheaper failure, while the spend itself keeps climbing and the process overhead starts taxing your shipping speed. Spend half your AI budget on measurement and you have a different disease: a bureaucracy of dashboards wrapped around a product nobody is improving.
Add the two curves and you get a U. The bottom of that U, for customer-facing AI products in 2026, sits in a surprisingly well-documented band: 25% to 35% of total AI project spend. Industry analyses of project-level eval budgets put the defensible range right there, and note something founders consistently miss: the fixed floor. Building an eval set, setting up labeling, standing up the measurement loop runs $40,000 to $120,000 for an enterprise whether the project is $200,000 or $2 million. Measurement has a minimum viable size.
Two warnings about reading this curve.
First, the 25-35% band is calibrated for customer-facing products where wrong answers have legal, financial, or trust consequences. An internal summarization tool does not need a third of its budget in measurement. The band moves with blast radius, which I will get specific about in the sizing section.
Second, solo founders read “$40,000 fixed floor” and despair, and they should not. That floor is an enterprise number: labeling vendors, compliance reviews, dedicated tooling. The solo version of the same floor is measured in your evenings, not your dollars. The shape of the curve is identical at every scale. Only the units change.
Why the Tax Stays Invisible
If unreliability costs this much, why does almost nobody budget for it? The data contains the answer, and it is a beautiful piece of organizational self-deception.
LangChain surveyed over 1,300 teams building agents and found that 89% had implemented observability. Among teams with agents in production, 94% had some form of it and 71.5% had full tracing. Meanwhile evals? 52% overall. Online evaluation of live traffic, 37%. The same survey found that quality, not cost, not latency, was the number one barrier to deployment, cited by 32% of respondents.
Sit with that for a second. Nine in ten teams can watch their agent work. Half can tell you whether the work was any good. And the thing blocking them from shipping is the thing only half of them measure.
Observability and evals sound like the same category, which is exactly the trap. Observability tells you what happened: the trace, the latency, the token count, the tool calls. Evals tell you whether what happened was correct. A trace of a hallucination is a beautifully instrumented hallucination. You can watch the Air Canada bot invent the bereavement policy in perfect detail and your dashboard will show you a green checkmark, sub-second latency, and a satisfied-looking completion.
This is why the reliability tax survives in well-run companies. The dashboards are full. The graphs go up and to the right. Everyone has the warm feeling of being data-driven while the one number that matters, the rate at which the system is right, is not on any screen. The 37% gap between benchmark performance and production performance lives precisely in this blind spot: teams measure what the model scored on a public test and what it costs to run, and skip the middle question of what it gets wrong on their actual traffic. I covered why public benchmarks cannot answer that question in the benchmark contamination playbook: the public numbers are marketing, your private numbers are the product.
There is also a psychological layer, and I say this as someone who has run the experiment on myself. Building evals means writing down, in advance, what your system gets wrong. It is the same discomfort as the pre-mortem, and founders dodge it the same way. Shipping feels like progress. Measuring feels like doubt. So the measurement waits until the incident, and the incident always costs more than the measurement would have. The pattern is close enough to the calibration problem that the fix rhymes: you do not fix overconfidence with more confidence, and you do not fix unreliability with more features.
The Eval Spend Ladder
Eval spending is not one decision. It is a ladder, and each rung buys you a different class of caught failure. Here is the ladder as I build it, from free to serious.
Rung 1: vibe checks. You change a prompt, run five favorite examples, eyeball the output, ship. This is where roughly half the industry lives, the 48% who told LangChain they have no evals. It costs nothing and catches nothing repeatably, because the five examples you remember to check are by definition the five your system already handles. Your real failure modes live in the inputs you never think to type.
Rung 2: the golden set. Twenty to a hundred real cases with verified correct answers, run on every meaningful change. This is the single highest-return purchase on the ladder. It converts “I think the new prompt is better” into a number, and it catches the failure class that kills trust fastest: the case that used to work and quietly stopped. For a solo founder this costs evenings. For a team paying for labeling and review, low thousands. Random prompt sampling on top of it surfaces failures you did not curate for; the lean golden set stays your deterministic gate.
Rung 3: the CI regression gate. The golden set, automated, blocking deploys. The difference between rung 2 and rung 3 is the difference between owning a smoke alarm and having it wired to something. This is also the rung that pays for itself on the day a provider deprecates your model, which they will, on roughly two weeks’ notice. Teams with a gate swap models in an afternoon and know exactly what changed. Teams without one find out from customers. I made the case in the model churn essay that your eval suite is the passport that makes you portable across providers. This rung is where the passport gets issued.
Rung 4: online evals. Scoring a sample of live production traffic, with LLM-as-judge for breadth and human review for the high-stakes slice. This is the rung that closes the 37% lab-to-production gap, because it is the only rung that measures the distribution your customers actually generate rather than the one you imagined. It is also the least-climbed rung in the industry: 37% adoption, against 89% for observability. Expensive, yes. But this is where you discover that your real users paste screenshots, write in three languages, and ask the one question your golden set never imagined.
Rung 5: the closed loop. Every production failure becomes a test case. Failure found, corrected output written, appended to the golden set with synthetic variations, re-evaluated forever. This is the rung where the eval budget stops being insurance and starts being an asset, because the suite now encodes everything your product has ever gotten wrong, which is a dataset no competitor and no model vendor has. Practitioner guidance is consistent on the payback: every hour invested in evals saves dozens of hours of incident response, and the gap widens as the suite compounds.
Sizing Your Eval Budget by Stage
The 25-35% band is the destination, not the starting point. Here is how I would size the budget at each stage of an AI product’s life, with the blast-radius adjustment built in.
| Stage | Eval spend | What you build | What you skip |
|---|---|---|---|
| Prototype (no customers) |
~5%, mostly time | A 20-case golden set, run by hand. Rungs 1-2. | All tooling. All vendors. Anything automated. |
| First customers (failures cost trust) |
10-15% of AI budget | Golden set to 100 cases, CI gate blocking deploys. Rung 3. | Online evals, unless failures touch money or law. |
| Scale (failures cost revenue) |
25-35% of AI budget | Online evals on sampled traffic, human review on the high-stakes slice, closed loop. Rungs 4-5. | Nothing. At this stage the tax is bigger than the budget. |
Three adjustments to the table.
Blast radius moves the band. If your AI writes internal first drafts that a human always reviews, halve every number. If it talks to customers about money, medicine, or law with no human in the loop, you are not in the 25-35% band, you are above it, and the Air Canada tribunal is the reason why. The question is never “what does eval tooling cost,” it is “what does my worst plausible output cost.” Price the second number first.
The floor is real but it scales down. The enterprise fixed floor of $40,000 to $120,000 covers eval-set construction, labeling, and infrastructure. The solo-founder equivalent of that floor is roughly two focused weekends to build a first golden set and wire a gate, plus a few hundred dollars a month of judge-model tokens. What does not scale down is the principle: below a minimum investment, you do not have a smaller eval system, you have none. Twenty cases run religiously beats two hundred cases run once.
Count the spend against COGS, not R&D. This is an accounting choice with teeth. Booked as R&D, the eval budget is the first thing cut in a tight quarter, because R&D is “optional.” Booked as cost of goods sold, it is understood as what it actually is: part of what it costs to serve a correct answer. Restaurants do not classify food safety as research. If reliability is part of your product, and for an AI product it is most of your product, then measuring it is part of your unit economics. I gave evals a single line in the internal AI stack build order, Day 14 of 30. This essay is that line item unfolded into its own budget.
Build vs Buy in the Eval Gold Rush
The eval category is having its gold-rush moment. One evaluation and benchmarking platform reached a $1.7 billion valuation in under four months. At least 20 venture-funded startups, Braintrust, Vellum, Patronus, Athina, Judgment Labs and more, are selling picks and shovels. Bessemer named evaluation and reliability one of the five frontiers of AI infrastructure for 2026, and if you have read my AI opportunity map, you will recognize the pattern: reliability infrastructure is where the durable money goes once the model layer commoditizes. When this much capital floods a tooling category, founders feel pressure to buy something to feel safe.
So let me draw the line that the vendor decks will not draw for you. The eval stack has two halves, and only one of them can be bought.
What you can buy: the plumbing. Trace storage, scoring orchestration, judge-model management, dashboards, labeling queues, regression diffing. This is genuinely undifferentiated work, the vendors do it better than you will, and prices are falling as 20 competitors fight for the same design partners. Buying plumbing at the “first customers” stage is usually right; building it is a two-week detour into infrastructure that ships zero product. The same logic I applied to the build-vs-buy line in the internal AI stack applies here without modification: buy what is standard, build what touches your business logic.
What you cannot buy: the judgment. Your golden set, the cases, the correct answers, the rubrics that define what “good” means for your product, is your business logic written down as test data. No vendor knows that a refund-policy answer in your product must cite the policy version, or that your users’ worst inputs arrive as screenshots of spreadsheets. The $1.7 billion platform can score your outputs against your rubric at impressive scale. It cannot write the rubric. The day you outsource the rubric is the day your quality bar becomes a vendor default.
And here is the part I find genuinely underpriced in the build-vs-buy conversation: the golden set is a moat. It is the accumulated record of every failure your product has survived, every edge case your real traffic has produced, every definition of correctness your customers have taught you. A competitor with the same model and the same vendor stack does not have it and cannot buy it. In a world where the models themselves churn every few months, the eval suite is one of the few assets that appreciates: every model swap it survives makes it more valuable. Vendors come and go from your stack. The rubric stays.
The practical split, by stage: prototype, buy nothing, a folder of cases and a script is the whole stack. First customers, buy plumbing if wiring the CI gate yourself would take more than a week, otherwise a script still wins. Scale, buy the platform, negotiate hard because you have 20 alternatives, and treat the golden set itself as code: versioned, reviewed, owned, in your repo, exportable in an afternoon.
Running the Ledger on Two Public Failures
Frameworks are cheap until you run them against receipts. Take the two-taxes ledger and apply it to two failures that are public enough to price.
Air Canada. The bot invented a bereavement fare policy. The direct cost, the tribunal award, was a few hundred dollars. The real bill: legal fees on a case the airline chose to fight and lose, a binding precedent that companies are liable for what their bots promise, and a news cycle that made “airline chatbot lies to grieving customer” a permanent search result. What would the eval budget have cost instead? Policy questions are a closed domain: the airline’s actual fare rules exist as documents. A golden set of policy questions with answers verified against those documents, plus a rule that policy answers must quote a retrieved source or refuse, sits on rung 2 to 3 of the ladder. Weeks of work, low thousands. The asymmetry is not subtle. The reliability tax billed at maybe a thousand times the eval budget that would have prevented it, and that is before the precedent, which the entire industry is still paying.
Deloitte Australia. A AU$440,000 government report shipped with roughly 20 fabricated citations and errors. Cost: a partial refund, a public correction, and a brand built on rigor wearing a headline about fabrication. The eval translation: citation verification is among the most automatable checks in the entire eval repertoire, does the cited source exist, does it say what the document claims. That is a rung 3 check, a deterministic script, not even a judgment call. The firm billed nearly half a million dollars for the report and could not afford the afternoon of verification tooling. Except of course it could. The budget was never the constraint. The line item simply did not exist.
Run the same exercise on your own product with the failure ledger below. The right column is the one your accounting system is currently hiding from you.
| Failure mode | Where the tax hides in your P&L | The receipt |
|---|---|---|
| Confident wrong answer reaches a customer | Refunds, churn, legal, support load | Air Canada: liable in court for its bot’s invented policy |
| Silent regression after a model or prompt change | Engineering payroll (forensic debugging), missed roadmap | 37% average gap between benchmark and production performance |
| Fabricated facts in delivered work | Rework, refunds, reputation | Deloitte: partial refund on a AU$440K report, ~20 errors |
| Pilot that never ships | Salaries, tokens, opportunity cost, credibility for the next pilot | 88% of POCs never reach production; 42% of companies abandoned most AI initiatives in 2025 |
| Decisions made on hallucinated content | Wherever the decision lands. Unbounded. | 47% of enterprise AI users admit to at least one |
The Contrarian Take: Evals Are a Pricing Decision
The standard framing, even among people who take evals seriously, is that evals are a quality practice. Engineering hygiene. The AI version of unit tests. Useful framing, and it undersells the idea badly.
Evals are a pricing decision. Here is what I mean.
What you charge for an AI product is a function of what you can promise. What you can promise is a function of what you can measure. The product that can say “we verify every citation” or “policy answers quote the source or refuse” is selling a different thing than the product that says “powered by the latest model,” and it can charge like it. Reliability you cannot demonstrate is reliability you cannot price. The eval suite is not just catching bugs; it is manufacturing the evidence that justifies your invoice. In a year when fewer than 1% of companies report significant returns on AI spend, “measurably correct” is the scarcest feature in the market.
The flip side: the cost panic itself is a misread of the same ledger. The companies slashing AI budgets because “AI is too expensive” are mostly companies that never measured what the spend produced. They cannot distinguish the spend that worked from the spend that did not, because nothing was instrumented to tell them. So they cut indiscriminately, which is exactly as rational as the indiscriminate spending was a year earlier, and roughly as well-informed. Both moves come from the same place: no measurement. The 42% abandonment statistic is not evidence that AI does not work. It is evidence that unmeasured AI cannot prove it works, and what cannot prove it works gets cut when the CFO gets nervous.
Now the honest counterargument, because there is one. A founder pre-product-market-fit who spends 35% of anything on measurement is making a mistake. If nobody wants the product, an exquisitely measured version of it fails with better dashboards. Velocity genuinely is the right priority before the product matters, and there is a real failure mode, I have watched teams fall into it, where the eval suite becomes a procrastination engine: forty metrics, beautiful CI, no customers. The resolution is the blast-radius rule, not a slogan. Eval spend should track the cost of your worst plausible output. Pre-PMF with human review on every output, that cost is near zero, so the budget should be near zero too: a 20-case set and honesty. The day an output goes to a customer unreviewed, the worst-case cost jumps discontinuously, and the budget should jump with it. Most founders get the first half right and miss the jump. That miss is the whole reliability tax.
What to Do Monday Morning
Ninety minutes, four artifacts. No tooling purchases.
First 15 minutes: pull the real AI number. Last month’s model API spend, subscriptions, and the hours anyone spent fixing, reviewing, or apologizing for AI output, priced at loaded cost. The second number is your first sighting of the reliability tax. Most founders have never put it next to the invoice. The ratio between the two is your before picture.
Next 30 minutes: write the failure ledger. Five rows, copy the table above. For each: what is the worst output my system plausibly produced this quarter, who saw it, what did it cost or what would it have cost if the wrong person had seen it. Be specific the way the pre-mortem forces you to be specific. If a row makes you uncomfortable, that row is your eval roadmap.
Next 30 minutes: start the golden set. Twenty real inputs from your actual product, each with the verified correct output and one sentence on why it is correct. Pull ten from normal traffic and ten from the failure ledger’s nightmares. Do not build infrastructure. A spreadsheet is fine. The asset is the judgment, not the format.
Last 15 minutes: put one number on the wall. Run your system against the twenty cases and write down the pass rate. That number, cost per correct task if you want the grown-up version, tokens are an input price, correctness is the output price, is now the number you watch. Re-run on every prompt change, every model swap, every Friday. The week it drops, you caught a regression for the price of a coffee. That is the entire eval budget conversation, version one. Everything else on the ladder is this loop with more zeros.
The founders I know who run this loop do not describe evals as a tax. They describe shipping the way you ship when a model deprecation notice is a calendar item instead of an emergency: change the config, run the suite, read the diff, done by lunch. That is what the budget buys. Not safety theater. Speed.
FAQ
How much should I budget for AI evals?
For customer-facing AI products at scale, the defensible range in 2026 is 25-35% of total AI project spend. Earlier stages need less: roughly 5% (mostly founder time) at prototype, 10-15% once real customers depend on outputs. The band moves with blast radius: halve it if humans review every output, raise it if your AI touches money, medicine, or law unreviewed.
What is the reliability tax?
The reliability tax is what an AI product pays for unmeasured failures: refunds, churn, legal exposure, engineering time lost to debugging silent regressions, and pilots that die before production. It is unbounded, lagging, and invisible in standard accounting because it never appears as a single line item. Evals convert it into a bounded, visible budget.
Are evals worth it for a solo founder or small startup?
Yes, but at solo scale: a 20-case golden set in a spreadsheet, run on every meaningful change, costs two evenings and catches the failure class that destroys early trust, the case that used to work and silently broke. Skip vendors and automation until wiring a CI gate yourself would take more than a week. Twenty cases run religiously beats two hundred run once.
What is the difference between observability and evals?
Observability records what your AI system did: traces, latency, token counts, tool calls. Evals measure whether what it did was correct. The industry runs an enormous gap between the two, 89% observability adoption versus 52% for evals, which means most teams can watch a hallucination happen in perfect detail without any system that flags it as wrong.
Should I buy an eval platform or build my own?
Buy the plumbing, never the judgment. Scoring infrastructure, dashboards, and labeling queues are commodities, and 20+ funded vendors are competing prices down. Your golden set, the cases, correct answers, and rubrics that define quality for your product, is your business logic as test data and must stay in your repo, versioned and exportable. Outsourcing the rubric makes your quality bar a vendor default.
When do evals actually pay for themselves?
Three common payback events: the first silent regression caught in CI instead of by a customer (saves an engineer-week of forensics), the first model deprecation survived in an afternoon because the suite scored the replacement, and the first enterprise deal where demonstrated accuracy justified premium pricing. Practitioner experience is consistent that every hour invested in evals saves dozens of hours of incident response.
Why do most AI pilots fail to reach production?
The dominant cited barrier is quality, named by 32% of teams in LangChain’s 1,300-respondent survey, ahead of cost and latency. Roughly 88% of proofs-of-concept never ship, and enterprise agentic systems average a 37% gap between benchmark and production performance. Pilots built without evals cannot prove they work, and what cannot prove it works gets cut.
Is the 25-35% eval budget number realistic for enterprises?
It comes from project-level analyses of customer-facing AI deployments in 2026, which also document a fixed floor: $40,000-$120,000 for eval-set construction, labeling setup, and measurement infrastructure regardless of project size. Enterprises spending $2 million on an AI project and zero on structured evaluation are not saving money; they are self-insuring an unbounded risk at a premium of exactly nothing.