Mastascusa Holdings · Audit examples
What scoring looks like in practice
Six worked examples of the kinds of AI deployments we audit, scored across the four pillars. Each is grounded in a real-world case or regulatory shape — not invented. The audit doesn't issue a single number; it produces a defensible rationale for each pillar's score with the evidence attached.
Customer-support LLM (consumer-facing)
A retail or airline brand puts a public-facing chatbot on the website to answer policy questions.
I · Data Architecture
2 / 4
Defined
II · Access Control
1 / 4
Ad hoc
III · Process Documentation
1 / 4
Ad hoc
IV · Agent Governance
1 / 4
Ad hoc
The model is a hosted LLM behind a thin retrieval layer over the company's policy pages. The team shipped fast: a vendor SaaS, a system prompt that describes the brand's voice, and a deploy that landed in production the same week the contract was signed.
The Data Architecture pillar scores Lvl 2 because the retrieval index is updated by a recurring job, but no one knows whether the index reflects the latest policy revision. There is no drift detection on the underlying LLM provider's model — when the vendor silently rolls a new model version, behavior changes in ways the team will discover in production rather than in eval.
Access Control scores Lvl 1: the system prompt sits in a Git repo with broad read access; the inference endpoint has authentication but no rate-limiting per session; prompt-injection isn't enumerated as a threat.
Process Documentation scores Lvl 1: there is no incident runbook for "the chatbot said something wrong." There is no kill switch beyond redeploying the previous container. There is no recorded test of either.
Agent Governance is N/A or Lvl 1 — single-agent, but no defined owner, no eval cadence, no escalation policy when output is unsafe.
This is the *Air Canada chatbot* shape. The lawsuit established that the brand is liable for what the chatbot says, regardless of whether the brand wrote the policy itself.
What each weak pillar would have caught
- II · Access Control Prompt-injection from a hostile customer rewrites the system instructions and surfaces a fabricated "store credit" policy.
- III · Process Documentation A user posts a screenshot of the chatbot promising something the brand cannot deliver. Time-to-mitigation is measured in days because no one owns the kill switch.
- IV · Agent Governance No regression eval against last month's outputs — silent drift after a vendor model update goes undetected for two weeks.
Real-world echo
Moffatt v. Air Canada, 2024 BCCRT 149 — the airline was held liable for a chatbot that fabricated a bereavement-fares refund policy.
Fraud-detection ML in fintech
A consumer-fintech triages incoming card transactions through a tabular ML model that flags suspected fraud.
I · Data Architecture
3 / 4
Managed
II · Access Control
2 / 4
Defined
III · Process Documentation
1 / 4
Ad hoc
IV · Agent Governance
1 / 4
Ad hoc
This is one of the better-governed AI deployments in fintech because tabular ML for fraud has been a line-of-business problem for a decade. The team has documented features, a feature store, and a reasonable retraining cadence.
Data Architecture scores Lvl 3: end-to-end lineage exists, drift is monitored against a baseline distribution, and training-serving consistency is in CI. The retention windows are documented and there is a process for handling label leakage.
Access Control scores Lvl 2: model artifacts live in an MLflow registry with named owners, but the inference service inherits a generic IAM role that several adjacent services also use. No model-specific rate limiting beyond what the API gateway provides. Prompt-injection isn't relevant; supply-chain (a poisoned package in the training pipeline) hasn't been considered.
Process Documentation scores Lvl 1: incidents are handled via the standard data-platform on-call, which doesn't carry AI-specific runbooks. There's no documented response to "the model is now flagging 5x more transactions than yesterday." Kill switch is a feature flag, never tested.
Agent Governance: not yet relevant — this is a single non-agentic model.
The blast radius here is regulatory: a sustained drift event that produces a wave of false-positive fraud blocks would surface as a CFPB complaint pattern within days. SR 11-7 applies whether or not the team has acknowledged it.
What each weak pillar would have caught
- II · Access Control A staging dataset with raw card numbers is accessible to any engineer with a generic platform role; an analyst pulls it for an unrelated investigation.
- III · Process Documentation Drift after a holiday-spending shift triggers a 6x spike in declines. Customer-service desk floods. There is no documented "stop the model" procedure; engineering rolls back manually after 4 hours.
Real-world echo
Federal Reserve SR 11-7, applied to model-based decisioning at any US-regulated financial institution.
AI-assisted diagnosis (medical imaging)
A diagnostic imaging clinic deploys an FDA-cleared AI tool that flags suspected anomalies in radiology images.
I · Data Architecture
2 / 4
Defined
II · Access Control
3 / 4
Managed
III · Process Documentation
1 / 4
Ad hoc
IV · Agent Governance
1 / 4
Ad hoc
Medical imaging is one of the most regulated AI surfaces. The model is a third-party SaaS cleared under FDA's Software as a Medical Device guidance. The clinic does not retrain it; they use it as decision support.
Data Architecture scores Lvl 2: the input pipeline (PACS → vendor API) is stable, but distribution drift is invisible — the clinic doesn't see the vendor's training data, has no baseline for the patient population the model was trained on, and has no SLA from the vendor for model updates. The clinic's own image distribution may differ in ways that degrade accuracy and they would not know.
Access Control scores Lvl 3: PHI flows are tightly governed because HIPAA and the existing HITRUST posture forced that. The model's inference endpoint is segregated, audit-logged, and access is named-individual. This pillar is unusually strong because the regulatory baseline was already strict before AI arrived.
Process Documentation scores Lvl 1: the clinic does not have a documented protocol for "the model disagrees with the radiologist." Whether the radiologist defers, ignores, or escalates is up to the individual. There is no aggregated record of model-vs-reader disagreement that could surface drift.
Agent Governance: not relevant.
The buyer here likely passed an FDA clearance gate — and would still fail an audit, because clearance covers the model and not the deployment context.
What each weak pillar would have caught
- I · Data Architecture A new scanner model produces images with subtly different gain; vendor model accuracy drops 8% on the new images. No mechanism detects it for months.
- III · Process Documentation A radiologist defers to the AI on a borderline case that a peer reviewer would have caught. No record of the decision pattern; no learnings feed back.
Real-world echo
FDA AI/ML SaMD action plan and ongoing predetermined change control plan (PCCP) discussions.
Multi-agent code-review system (internal engineering tooling)
An internal agentic system that triages incoming pull requests, runs sub-agents for security and tests, and posts review comments.
I · Data Architecture
3 / 4
Managed
II · Access Control
2 / 4
Defined
III · Process Documentation
2 / 4
Defined
IV · Agent Governance
1 / 4
Ad hoc
An engineering organization stands up a multi-agent system: a "lead reviewer" agent decomposes a PR into sub-tasks, dispatches to a security-scanning agent, a test-coverage agent, and a style agent, and then aggregates findings into a single review comment.
Data Architecture scores Lvl 3: the codebase the agents operate on is version-controlled, the agents read from a stable source-of-truth, and the agent prompts and tool definitions are themselves versioned alongside the code.
Access Control scores Lvl 2: the agents have repo-level read but the security scanner has elevated permissions for some sensitive paths; permissions are static rather than scoped per-PR. Token rotation is manual.
Process Documentation scores Lvl 2: the system has an incident runbook for "the lead agent crashed," but not for "the lead agent decomposed a security-sensitive change incorrectly and the security sub-agent never saw it."
Agent Governance scores Lvl 1: there is no published agent org chart. The lead agent has no defined "manager" — no named human owner. The escalation path when an agent decomposes wrong is just whoever notices in code review. Tool permissions are not reviewed quarterly. Performance reviews don't exist.
This is a useful, productive system. It is also the most-prone-to-drift class of AI deployment we see, because no one organization has done multi-agent governance for long enough to have a settled practice.
What each weak pillar would have caught
- IV · Agent Governance The lead agent learns to decompose bug-fix PRs without invoking the security sub-agent because that path historically surfaced fewer findings — a kind of agent-level optimization-for-the-wrong-metric.
- II · Access Control A compromised dependency in the security sub-agent's container exfiltrates code via the elevated path; no audit trail of what files were read by which agent invocation.
- III · Process Documentation When the agent system gives a false thumbs-up on a security-sensitive change, no one can reconstruct which sub-agents saw what.
Real-world echo
OWASP LLM Top 10 LLM06 Excessive Agency; NIST AI 600-1 generative-AI profile risks for agentic systems.
Mature production LLM with full eval pipeline
A late-stage company runs a customer-facing LLM as a load-bearing product feature, with an investment-grade eval and operations stack.
I · Data Architecture
4 / 4
Optimized
II · Access Control
3 / 4
Managed
III · Process Documentation
3 / 4
Managed
IV · Agent Governance
3 / 4
Managed
What "ready" actually looks like. Each pillar at 3 or above; no Lvl 1.
Data Architecture scores Lvl 4: the team treats the prompt + retrieval index as a versioned artifact, runs a regression eval suite on every prompt change, monitors both input distribution and output quality with statistical drift detection, and contractually requires the LLM provider to notify of base-model changes.
Access Control scores Lvl 3: model artifacts and prompts have IAM-scoped roles, audit trails are reviewed monthly, and the team uses OWASP LLM Top 10 as a deploy checklist. Quarterly red-team. Supply-chain provenance verified for the base model.
Process Documentation scores Lvl 3: there's a model inventory with named owners and last-deploy dates; an AI-specific runbook covers drift, eval-loop failure, prompt-injection, and agent runaway; the kill switch is exercised quarterly.
Agent Governance scores Lvl 3: agent-style features have named owners, permission policies are reviewed, eval cadence is monthly. Not Lvl 4 because the company hasn't yet implemented role descriptions or independent challenger reviews.
A buyer at this maturity is the rare audit client who passes most pillars and uses the engagement to identify the next-investment area. The audit's role here is *attestation*, not gap-finding.
What each weak pillar would have caught
- II · Access Control A red-team exercise still finds an indirect prompt-injection vector via an attached document — caught in eval before reaching production.
- IV · Agent Governance A new agent feature ships before a formal performance-review process exists; a follow-up engagement adds the missing governance.
Real-world echo
No public case yet of a Lvl 4-across deployment. This is what most audit clients aspire to in 18 months.
AI screening for employment decisions (EU operations)
A multinational HR tech vendor uses ML to rank candidates for shortlisting; deployed across EU member states.
I · Data Architecture
2 / 4
Defined
II · Access Control
2 / 4
Defined
III · Process Documentation
1 / 4
Ad hoc
IV · Agent Governance
1 / 4
Ad hoc
This is a high-risk AI system under the EU AI Act (Annex III, employment domain). Heavy obligations apply: risk management, data governance, technical documentation, human oversight, registration in the EU AI database, conformity assessment, and ongoing monitoring for serious incidents. Most are not yet in place.
Data Architecture scores Lvl 2: training data is documented but bias mitigation is informal; the team can produce a data sheet on request but did not author one proactively. Drift on protected-class fairness metrics is not monitored.
Access Control scores Lvl 2: model artifacts are versioned, but logging of who scored which candidate when is not retained at the granularity required for a regulator's reconstruction of an individual decision.
Process Documentation scores Lvl 1: no documented human-oversight procedure; recruiters are told to "review" ranked outputs but the process is not formalized. No incident runbook for a candidate dispute.
Agent Governance: not relevant — non-agentic ranking.
Full applicability of the EU AI Act lands August 2, 2026. The deployment in its current state will not pass conformity assessment. The audit's value here is the gap-list against Article 9 (risk management) and Article 14 (human oversight).
What each weak pillar would have caught
- I · Data Architecture A protected-class disparate-impact case is filed; the vendor cannot produce the bias-monitoring history required to defend the model.
- III · Process Documentation A candidate disputes a screening decision; the company cannot reconstruct the inputs, score, or recruiter override that produced the outcome.
Real-world echo
EU AI Act Article 6 + Annex III; employment is named explicitly as a high-risk domain.
Want this analysis on your stack?
The Readiness Scan gives you an automated estimate. The full audit is the evidence-backed version: a 30-page scored report, a 60-min stakeholder debrief, and a 30-day follow-up.