Five Ways Your AI Deployment Will Quietly Fail Before Anyone Notices

Most AI failures don’t start with a dramatic outage. They start quietly — a model doing exactly what it was trained to do, on data that’s no longer the data it was trained on, producing outputs no one is checking, on systems no one owns.

The 2025 Stanford HAI AI Index Report counted 233 documented AI incidents in 2024 — a 56.4% year-over-year increase, the largest single-year rise the AI Incident Database has recorded.¹ The headline of that data isn’t “AI is getting worse.” It’s “the surface area of deployment is getting wider faster than the discipline of operating it is scaling.”

Here are the five failure modes that cause the most production damage in my experience, and the cases that made each one concrete.

1. Silent data drift

A model in production sees a constant stream of new inputs. Those inputs are not drawn from the same distribution the model was trained on — they can’t be, because time passes and the world changes. Feature distributions shift (data drift); the relationship between features and labels shifts (concept drift).² Both degrade model quality. Neither announces itself.

The canonical detection techniques — Kolmogorov-Smirnov tests, ADWIN, KSWIN, Population Stability Index — all require you to have baseline distributions documented and monitored.³ Without that scaffolding, the first signal you get that a model has drifted is a downstream business outcome: conversions dropping, fraud slipping through, approvals denied to the wrong people.

The ugly part: when ground-truth labels are delayed or unavailable, which is most of production, you can’t even measure accuracy live. You have to rely on input-distribution monitoring as a proxy. That’s a whole parallel system most teams haven’t built.

The NIST AI RMF’s MEASURE function calls this out explicitly — “trustworthiness characteristics” are supposed to be monitored across the deployment lifecycle.⁴ The gap is not in the framework; it’s in the implementation.

Audit question: For every production model, can the team produce the baseline distribution, the current distribution, the drift metric in use, and the last time that metric was reviewed?

2. Stale features and training-serving skew

Feature stores and feature pipelines are among the highest-leverage-to-highest-risk components in any ML system. A feature computed one way during training and another way during serving will produce predictions that look reasonable and are wrong.

This is not an exotic edge case. It happens when:

A training script uses a full-history aggregation; the serving path uses a 7-day window.
A feature is filled with different default values in the two paths.
Time-zone conversions are applied once during batch training and twice during streaming inference.
A data-quality filter upstream is added or removed without redeploying downstream models.

NIST AI RMF’s MAP function requires documenting deployment context.⁴ In practice, this means writing down what goes into the model, where each feature comes from, and how it’s computed in both environments. If you can’t hand a new engineer that document in under five minutes, you have training-serving skew risk whether you’ve observed it yet or not.

Audit question: Pick a single production feature at random. Can the team show you the training-time computation, the serving-time computation, and a test that asserts they produce identical output on the same input?

3. Broken evaluation loops — hallucinations nobody catches

In February 2024, the BC Civil Resolution Tribunal ruled against Air Canada in Moffatt v. Air Canada. The airline’s customer-service chatbot had told Jake Moffatt, a grieving customer, that he could apply for a bereavement-fare refund retroactively within 90 days of booking. The policy was the opposite — advance booking was mandatory.⁵

Moffatt booked. Air Canada refused the refund. The tribunal found Air Canada liable for negligent misrepresentation and ordered payment of $812.02.

The actual damages were small. The precedent is not. The tribunal’s operative finding:

“I find Air Canada did not take reasonable care to ensure its chatbot was accurate.”⁵

This is a failure of the evaluation loop. A customer-facing LLM was deployed without a production evaluation harness that would have caught the hallucination — or if such a harness existed, its outputs were not connected to anyone whose job was to act on them.

The NIST AI 600-1 Generative AI Profile enumerates hallucinations as one of twelve risks “unique to or exacerbated by” generative AI.⁶ Hallucinations are a known failure mode. The failure in Air Canada’s case was not that an LLM hallucinated — every LLM hallucinates. The failure was that the organization had no mechanism to detect when it had and no person whose job was to respond when it did.

Audit question: For every customer-facing LLM surface, can the team produce the evaluation dataset, the pass/fail threshold, the most recent evaluation run, the person who reviewed it, and the change that was made in response to failures?

4. Unowned models and unowned code

The best-documented case in deployment-risk history is still Knight Capital, 2012. The failure modes are identical to the ones any modern AI team should be stress-tested against.

On August 1, 2012, Knight Capital lost over $460 million in approximately 45 minutes.⁷ The technical root cause was a manual deployment of new RLP (Retail Liquidity Program) code to eight production servers. The engineer missed one. That server still had a dormant test algorithm called Power Peg — code Knight had stopped using nearly a decade earlier and never removed. When the RLP flag was set by the new code on the seven updated servers, the eighth server triggered Power Peg instead.

For roughly 45 minutes, Knight’s order routing system executed 4 million orders across 154 stocks for more than 397 million shares, leaving the firm with a net long position of about $3.5 billion and a net short position of about $3.15 billion before the runaway code could be stopped.⁷

The failure modes:

Unowned legacy code. Power Peg had not been used in nine years. No one owned its existence or had responsibility for its removal. This is the exact pattern that shows up in modern AI stacks as: an old model still hosted behind a load balancer, an old fine-tune still referenced by an experimental flag, an old embedding index still queryable by a service you forgot about.
Non-atomic deployment. Partial rollout plus a lingering legacy flag is a catastrophic-behavior pattern under load. “Deployments are atomic” is a claim most teams make and can rarely demonstrate.
No kill switch. Forty-five minutes to identify and stop a runaway system is not a technology failure. It’s a runbook failure. Someone needed to be able to answer “orders are pouring in that nobody commanded” and no one could.

SR 11-7, the Federal Reserve’s 2011 guidance on model risk management, requires “effective challenge” — independent validation by people not involved in model development.⁸ Knight Capital happened the year after SR 11-7 was issued, at a financial institution, and the failure mode SR 11-7 was written to prevent is precisely what caused the loss.

Fifteen years later, most AI teams still do not have effective challenge. The model team validates its own model. That’s not validation. That’s reading your own homework.

Audit question: For every deployed model, who owns it? Who validates it independently? Who has authority to kill it? Can you stop a runaway model in under five minutes?

5. Metric-vs-reality gaps

The Zillow Offers shutdown is the cleanest case study of this failure mode in the public record.

Zillow built the Zestimate — an automated home valuation model — over more than a decade, optimized to minimize statistical error against comparable sales prices. Then, starting in 2019, Zillow adapted that model into an iBuying pricing engine, using the model’s output as the offer price for homes Zillow itself would purchase.⁹

The model was built to minimize accuracy error. The business needed it to calculate expected value including holding costs, rehab costs, liquidity risk, and market-shift risk. Those are different objectives.

To make it worse, Zillow bid above the model’s own predictions on cookie-cutter homes to win deals, severing the link between the model’s output and the actual transaction price. Zillow expanded to non-cookie-cutter homes in markets where the Zestimate had never been validated. When the housing market shifted in 2021, Zillow was left holding approximately 7,000 homes purchased at prices it could no longer recover. The company recorded $569 million in Q3 2021 inventory write-downs, laid off 25% of its workforce, and shut Zillow Offers down in November 2021.⁹

The model did not malfunction. It did exactly what it had been built to do. The metric it optimized — prediction accuracy against comparable sales — was not the metric the business needed — expected profit under real-world market conditions. No one validated the metric-alignment before scaling. When the discrepancy became visible, the loss was already booked.

This is the failure mode that every AI initiative anchored to a “model accuracy” KPI is exposed to. Accuracy against training labels is not business outcome. The job of the auditor is to ask, persistently and unkindly, whether they are the same thing. They almost never are.

Audit question: For every deployed model, what business metric is it actually optimized against? How is that connected — with evidence — to the downstream business outcome the organization cares about? When did someone last validate that the connection still holds?

Why these five

Drift, skew, evaluation gaps, ownership gaps, and metric-vs-reality gaps cover most production AI failures before you get to the adversarial surface. The adversarial surface — prompt injection, model extraction, data poisoning, supply-chain compromise, enumerated in MITRE ATLAS and the OWASP LLM Top 10¹⁰¹¹ — adds another five or six categories of quiet failure, worth their own treatment.

What unites all of them is that they are not AI problems specifically. They are operations problems that happen to apply to AI. The Air Canada chatbot case and the Knight Capital trading case are twelve years apart and use different technology, and the failure modes are indistinguishable. Unowned code, non-atomic deployment, no kill switch, no independent validation, no connection between what the system optimizes and what the business needs — these are operations failures. AI just made the consequences faster and the decisions opaquer.

A readiness audit that doesn’t ask about these five things is not a readiness audit. It’s a sales deck with a cover sheet.

Sources

Stanford Institute for Human-Centered AI, The 2025 AI Index Report, Responsible AI chapter (2024 incident data). https://hai.stanford.edu/ai-index/2025-ai-index-report/responsible-ai ↩
Wikipedia contributors, Concept drift. https://en.wikipedia.org/wiki/Concept_drift ↩
Evidently AI, What is data drift in ML, and how to detect and handle it. https://www.evidentlyai.com/ml-in-production/data-drift ↩
National Institute of Standards and Technology, Artificial Intelligence Risk Management Framework (AI RMF 1.0), NIST AI 100-1, January 26, 2023. https://nvlpubs.nist.gov/nistpubs/ai/nist.ai.100-1.pdf ↩ ↩²
Moffatt v. Air Canada, 2024 BCCRT 149 (British Columbia Civil Resolution Tribunal, February 14, 2024); CBC News coverage, February 15, 2024. https://www.cbc.ca/news/canada/british-columbia/air-canada-chatbot-lawsuit-1.7116416 ↩ ↩²
National Institute of Standards and Technology, Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile, NIST AI 600-1, July 26, 2024. https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf ↩
U.S. Securities and Exchange Commission, In the Matter of Knight Capital Americas LLC, Administrative Proceeding File No. 3-15570, Release No. 34-70694, October 16, 2013. https://www.sec.gov/files/litigation/admin/2013/34-70694.pdf ↩ ↩²
Board of Governors of the Federal Reserve System / Office of the Comptroller of the Currency, SR 11-7: Supervisory Guidance on Model Risk Management, April 4, 2011. https://www.federalreserve.gov/boarddocs/srletters/2011/sr1107.htm ↩
Stanford Graduate School of Business, Flip Flop: Why Zillow’s Algorithmic Home Buying Venture Imploded. https://www.gsb.stanford.edu/insights/flip-flop-why-zillows-algorithmic-home-buying-venture-imploded ; AI Incident Database incident #149. https://incidentdatabase.ai/cite/149/ ↩ ↩²
MITRE Corporation, MITRE ATLAS — Adversarial Threat Landscape for AI Systems, version 5.1.0, November 2025. https://atlas.mitre.org/ ↩
OWASP Gen AI Security Project, OWASP Top 10 for LLM Applications 2025. https://genai.owasp.org/llm-top-10/ ↩