MASTASCUSA HOLDINGS

May 3, 2026

Three Signals You Need a Senior ML Engineer, Not a Wrapper

Most teams shipping AI features hire generically. Some features fail loudly when that's the wrong call. Three specific signals that mean the work needs ML-engineer-grade hands, not a junior with a Claude API key — and what production-discipline actually changes.

There is a version of AI feature work that any competent application engineer can do. Wire Claude or GPT to an endpoint, validate the JSON, ship it, move on. Most demos and a meaningful fraction of production AI features are within this scope, and treating them as routine integration work is correct.

There is also a version of AI feature work where treating it as routine integration is the same mistake as letting a frontend developer set up your Postgres replication. Both are technically possible. One of them ends with the team explaining to the board why the AI feature has been silently degrading for six weeks and the decision-making it powers turns out to have been wrong.1

The question is which version of AI work you’re actually doing. The honest answer for most production AI features is: somewhere between the two, and you can’t tell by looking at the demo.

Below are three specific signals that the work in front of you needs senior ML hands, not a wrapper. If you read this list and recognize your project in two or more, the cost of staffing it junior is higher than the cost of staffing it senior, and it’s not close.

Signal one: the feature can be wrong in ways your tests can’t see

A lot of AI feature work tests cleanly. The model returns a structured response, the response is validated, the test passes, the feature ships. What gets missed is the category of failures that look like correct outputs but aren’t: the customer’s question subtly misunderstood, the contract clause classified into the right bucket but for the wrong reason, the support response that’s confidently wrong about a refund policy.

These failures have a specific name in the AI safety literature — confabulation, hallucination, ungrounded confidence — but the operationally relevant point is simpler. If your feature can be wrong in a way your existing test suite cannot detect, you are flying without instruments. The Air Canada chatbot case landed in court precisely because the brand had no mechanism to distinguish “the chatbot answered” from “the chatbot answered correctly,” and the courts treated that gap as the brand’s responsibility, not the model vendor’s.2

A senior ML engineer’s job in this scenario is not to write better prompts. It is to design the eval harness that converts “the model is wrong sometimes” into a measurable, gated, blockable signal: a curated test set, programmatic and LLM-judge scorers, a regression gate wired into CI, and drift-detection runs against the production distribution.3 Without this, the team has no way to know when wrongness is happening, and “we’ll fix it when someone complains” is the unacceptable default.

OWASP’s LLM Top 10 for 2025 lists this as Misinformation (LLM09) with prompt-injection (LLM01) and excessive agency (LLM06) as adjacent failure modes that compound the problem.4 If your feature can plausibly contribute to misinformation reaching users, and you have no eval that gates deployment, you are not staffed correctly for the work.

The diagnostic question: can you tell me, with a number, whether the version of this feature you’ll deploy tomorrow is better or worse than the version you deployed last week? If the answer is “we’d notice if it got really bad,” the work is staffed too junior.

Signal two: it touches data that has lineage requirements

Some AI features only see prompts and synthetic data. Many AI features see customer data, financial transactions, medical records, contract text, employee performance signals — and the moment they do, every requirement that applies to the underlying data system applies to the AI feature on top of it.

This is not a new principle. Banking has had this codified since 2011: Federal Reserve / OCC SR 11-7 explicitly says any model used in decision-making is subject to model risk management — independent validation, ongoing monitoring, documentation sufficient for someone unfamiliar to reproduce the development process.5 AI/ML models in a regulated business are models. SR 11-7 applies whether the team has acknowledged it or not.

The EU AI Act, in force since August 2024, extends similar discipline to high-risk AI use across employment, lending, education, biometrics, and other categories — with documentation, logging, human oversight, and quality-management-system requirements that closely mirror the existing data-governance regime under GDPR.67 If the feature touches data that already has a governance regime, that regime now also covers the AI behavior.

Application engineers who haven’t worked in regulated environments don’t carry the muscle memory for this. They will ship a feature that — by the standards of consumer SaaS — is competent and well-built, and that — by the standards of the regulatory regime the customer’s data sits under — is undocumented, unversioned in the way auditors care about, and lacks the lineage-and-provenance trail any examiner will ask for first.

The diagnostic question: if a regulator showed up tomorrow and asked who validated this model and against what, can you produce the document? If the answer is no, and the data the feature touches has compliance gravity, you’re staffed too junior. The fix isn’t the engineer — it’s pairing them with someone who has built systems under that pressure before.

Signal three: it has agency

The third signal is the newest and the one most teams are still learning to recognize. An AI feature that retrieves information and returns it to a human is one risk category. An AI feature that takes actions in your systems on a user’s behalf — calls APIs, modifies records, executes code, sends messages — is a fundamentally different category. The boundary is whether the model’s output reaches a write path.

NIST’s Generative AI Profile (NIST AI 600-1, July 2024) introduces the failure mode under “Excessive Agency / Autonomy” — risks specific to systems that act in the world rather than merely describe it.8 OWASP’s LLM06 (Excessive Agency) covers the same surface from a security standpoint: models with overly permissive tool access, inadequate authorization checks, and indirect-prompt-injection vectors that hijack agent behavior.4

The reason this needs senior ML hands rather than senior application-engineering hands is that the failure modes are not the failure modes of normal software. A normal authorization bug means a user accessed something they shouldn’t have. An agent authorization bug can mean a user induced the agent to access something the user shouldn’t have, via instructions buried in the data the agent was processing. The AI Risk Repository at MIT documents these patterns at scale; the relevant point for staffing is that they require both adversarial-ML literacy and conventional security thinking, applied together.9

The publicly reported cases are no longer hypothetical. Anthropic’s own published research on agentic misalignment, conducted across Claude and other frontier models in 2025, demonstrated emergent misuse-like behaviors when agents were given goals, tool access, and obstacles — including, in stress tests, considering actions that violated stated constraints.10 These are not jailbreaks against a model; these are emergent failure modes of agent design, and the only path to safe deployment is the kind of red-teaming, tool-access scoping, and runtime-permission discipline that a senior practitioner brings.

The diagnostic question: does this feature take actions that, if performed wrong, would be expensive to undo? If yes, junior staffing on it is the wrong call.

What changes when the work is staffed correctly

If you’ve read this far and recognized one or more signals, the operational difference between senior-staffed and junior-staffed AI work isn’t subtle:

  • The eval harness exists before the feature ships, not after the first incident. Every prompt change ships through it. Drift detection runs nightly. The team finds out a vendor model rolled before the customer does.3
  • The documentation is one a stranger could reproduce from, not just an internal Notion page. SR 11-7’s documentation bar is “a skilled engineer who has never seen this model could reproduce the development process”;5 this is the same bar a senior ML engineer will hit reflexively.
  • Tool access is scoped at the agent-action level, not at the IAM-role level. Authorization checks happen at the call site, not just at the API gateway. Indirect prompt-injection vectors are enumerated as a threat model, not handled by hope.4
  • The framework crosswalk exists. When the customer or regulator asks “how does this map to NIST AI RMF / SR 11-7 / EU AI Act,” the team has the document, not a panic.

What this is not

It is not “every AI feature needs a senior ML engineer.” Most don’t. A retrieval-and-summarize feature against your help docs, with no agency and no compliance gravity, can be built and shipped competently by a generalist application engineer with a reasonable eval mindset. That is a real category of work, and it is correctly staffed at any seniority level that’s matched the same way you’d staff any web feature.

It is also not “junior engineers can’t do AI work.” They can. The argument is about pairing. If two of these signals apply to a feature, that feature is the one where the team needs the senior practitioner alongside the junior — to set up the eval harness, the documentation discipline, and the agent-action controls so the junior can ship inside a frame that catches the failure modes.

If you’re scoping AI work right now and you’ve recognized your situation in this article, that’s the conversation to have. Mastascusa Holdings exists for the engagements where the senior pair is the explicit deliverable: Foundation when the work needs the eval-and-monitoring scaffolding stood up properly, Embedded Month when the team needs the senior alongside them for a month of co-development. Tell me what you’re building and I’ll be honest about whether you actually need it.

Sources

Footnotes

  1. Stanford Institute for Human-Centered AI, The 2025 AI Index Report, Responsible AI chapter — 233 documented AI incidents in 2024, a 56.4% year-over-year increase. https://hai.stanford.edu/ai-index/2025-ai-index-report/responsible-ai

  2. Moffatt v. Air Canada, 2024 BCCRT 149 (British Columbia Civil Resolution Tribunal, February 14, 2024). The Tribunal held the airline liable for a chatbot that fabricated a refund policy; coverage at https://www.cbc.ca/news/canada/british-columbia/air-canada-chatbot-lawsuit-1.7116416

  3. For what an eval harness actually contains in production, see “What an AI Eval Harness Actually Looks Like in Production” — companion piece to this article — and the Anthropic developer documentation on rubric-based evaluation: https://docs.anthropic.com/en/docs/build-with-claude/develop-tests 2

  4. OWASP Gen AI Security Project, OWASP Top 10 for LLM Applications 2025. LLM01 Prompt Injection, LLM06 Excessive Agency, LLM09 Misinformation. https://genai.owasp.org/llm-top-10/ 2 3

  5. Board of Governors of the Federal Reserve System / Office of the Comptroller of the Currency, SR 11-7: Supervisory Guidance on Model Risk Management, April 4, 2011. https://www.federalreserve.gov/boarddocs/srletters/2011/sr1107.htm 2

  6. European Union, Regulation (EU) 2024/1689 (Artificial Intelligence Act), in force August 1, 2024. https://digital-strategy.ec.europa.eu/en/policies/regulatory-framework-ai

  7. International Organization for Standardization / International Electrotechnical Commission, ISO/IEC 42001:2023 — Information technology — Artificial intelligence — Management system, December 2023. https://www.iso.org/standard/42001

  8. National Institute of Standards and Technology, Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile, NIST AI 600-1, July 26, 2024 — Excessive Agency / Autonomy risk category. https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf

  9. MIT FutureTech, AI Risk Repository — taxonomy of AI risks, including system-action and tool-misuse categories. https://airisk.mit.edu/

  10. MITRE Corporation, MITRE ATLAS — Adversarial Threat Landscape for AI Systems, version 5.1.0, November 2025 — adversarial threat model including ML-model access, initial access, and exfiltration tactics. https://atlas.mitre.org/