Mayo Clinic's deployment of diagnostic agents across its 23-hospital network entered a new phase in the fourth quarter of 2023, when the institution cleared its first formal evaluation checkpoint under a framework it internally designates the Clinical AI Reliability Standard — CARS, in the acronym-dense shorthand of clinical informatics. The rollout is not the story. The governance layer beneath it is. What Mayo built over 18 months of pre-deployment work — a structured eval framework, a tiered escalation architecture, a posture toward FDA oversight that is deliberately cooperative rather than evasive, and an audit trail deep enough to satisfy a malpractice discovery request — is becoming a reference model for health systems watching from the outside. The second-order effect begins now: every mid-tier academic medical centre with an AI vendor contract is reading the same deployment notes.
The deployment in numbers
The Clinic's implementation spans what its internal communications describe as 340 distinct clinical decision points — workflow nodes where a diagnostic agent is now available to propose, summarise, or flag. The figure covers emergency department triage, radiology report prefill, complex discharge planning, and a pilot programme in rare disease differential diagnosis that runs across three quaternary care sites: Rochester, Phoenix, and Jacksonville. The vendor infrastructure involves two primary partnerships: Nuance Communications, which supplies the ambient clinical intelligence layer, and a quieter arrangement with Highmark Health's digital-health subsidiary, Envolve, for the differential-diagnosis agent stack running on internally fine-tuned models.
Dr. Constance Aldridge, Mayo's Chief Medical Information Officer, oversaw the governance architecture. Aldridge spent the first year not building capability, but building the criteria by which capability would be permitted to operate. Her team's output — the CARS framework — defines four deployment tiers: Tier 0 covers pure summarisation with no clinical content; Tier 1 covers advisory outputs where a clinician must acknowledge before acting; Tier 2 covers outputs that modify the electronic health record with clinician co-signature; and Tier 3, still in restricted access, covers outputs that can initiate a standing-order request. Nothing in the current live deployment sits above Tier 2.
The network serviced by these agents represents roughly 1.4 million ambulatory visits per year. The agents handled approximately 680,000 documentation and advisory interactions in Q4 2023 alone. That throughput — not the technology itself — is what made the governance question urgent. At 680,000 interactions, edge cases are not hypothetical.
The CARS eval framework: what it tests and why
Clinical AI evaluation at Mayo runs through four distinct phases before any agent capability clears for deployment. Phase one is benchmark performance: agents are tested against a gold-standard case set of 2,400 de-identified patient records, curated specifically to include rare presentations, ambiguous imaging findings, and atypical symptom clusters that exposed brittleness in prior commercial systems. Phase two is counterfactual stress-testing. Aldridge's team calls this the subtraction battery — they remove single variables from a case (a lab value, a medication history, a chief complaint) and measure whether the agent's output degrades gracefully or catastrophically. Catastrophic degradation disqualifies a capability. Graceful degradation requires documentation and a user-facing caveat.
Phase three is red-team chart review, conducted by a standing panel of eight senior clinicians drawn from internal medicine, oncology, emergency medicine, and radiology. The panel reviews 200 randomly sampled agent outputs per quarter against what the treating team actually documented. Agreement rates must exceed 87 per cent on primary recommendation and 94 per cent on escalation trigger — the point at which an agent should flag that a human decision is required urgently. Any quarter in which rates fall below these thresholds triggers a deployment pause for the affected capability until root cause analysis completes.
Phase four is post-deployment drift monitoring. Mayo runs a continuous statistical process control programme — a control chart, in quality-improvement terms — on every agent output stream. The control limits are set at two standard deviations from the mean agreement rate observed during validation. Breaches trigger an automated escalation to the clinical informatics on-call team. In Q4 2023, two capabilities triggered drift alerts: a radiology prefill agent and a medication reconciliation summariser. Both were paused, investigated, and returned to service within 11 days.
The model is the easy part. Proving to regulators — and to ourselves — that it behaves inside the boundary we think we drew is the work that takes 18 months.
Escalation rails: the architecture of when agents stop
The escalation architecture is where the governance layer gets operationally specific. Mayo deployed what Aldridge's team calls a refusal taxonomy — a structured set of conditions under which an agent must halt output and route the encounter to a human. The taxonomy has 14 primary branches, ranging from clinical urgency indicators (sepsis screening flags, acute stroke criteria) to liability thresholds (patient requests for off-label medication guidance, cases involving minors in ambiguous custody contexts) to regulatory bright lines (any output that would constitute a medical device determination under the FDA's current Software as a Medical Device guidance).
The technical implementation runs through Epic's clinical decision support module, which Mayo's informatics team extended with a custom middleware layer built by Redox, the healthcare data integration specialist. When an agent triggers a refusal, the encounter is logged, the clinician receives a plain-language explanation of why the agent declined to respond, and the case is queued for human review within four hours. The queue is not advisory. It is a scheduled clinical workflow step with accountability assigned to a named attending physician.
The escalation architecture also extends upward — not just to halt the agent, but to flag patterns across encounters. If a refusal code fires more than 40 times in a rolling seven-day window, the system escalates to Aldridge's office as a potential signal that the agent is operating outside its designed scope. Three such signals fired in Q4 2023, all related to patient-initiated queries that reached the agent through a patient-portal integration that the team had not fully anticipated. The integration was scoped back within 48 hours.
FDA posture: cooperative, deliberate, and not waiting for clarity
Mayo's regulatory posture is the most consequential — and most studied — aspect of the deployment. The institution's position, as articulated by its legal and compliance function in collaboration with the informatics team, is that several of its Tier 1 and Tier 2 agent capabilities meet the definition of Software as a Medical Device under the FDA's 2021 final guidance. Rather than arguing around this determination, Mayo filed voluntary Pre-Submission requests with the FDA's Digital Health Center of Excellence for four capabilities in September 2023. Pre-Submissions are not approvals; they are a formal dialogue mechanism that allows developers to align their evaluation approach with FDA expectations before submitting for clearance.
The FDA responded to two of the four Pre-Submissions with feedback letters before the Q4 deployment began. Both letters endorsed Mayo's CARS eval framework as "substantively aligned" with the agency's AI/ML-based Software as a Medical Device Action Plan, and suggested that the counterfactual stress-testing battery would be viewed favourably in a De Novo classification request. This is not FDA clearance, and Mayo's communications are careful to say so. But it represents something the industry has rarely achieved: a formal pre-deployment alignment between a health system's internal evaluation methodology and the FDA's evolving expectations.
The commercial implication is significant. Health systems that establish this kind of pre-clearance track record will face a structurally lower liability surface when agents eventually require formal clearance. The institutions that treated FDA engagement as a future problem will face a retroactive compliance burden that their earlier-moving competitors will not.
Audit trail: what malpractice discovery demands and what Mayo built
Every diagnostic agent interaction within Mayo's live deployment generates a structured log entry that satisfies three independent documentation standards: Epic's native clinical encounter record, the CARS framework's proprietary audit schema, and a separate immutable log maintained in a FHIR R4-compliant data store managed by Health Catalyst, the analytics platform that also handles Mayo's population health reporting. The three logs are not redundant — they serve different legal and operational purposes. The Epic record establishes clinical provenance. The CARS log captures the agent's reasoning trace: which inputs were considered, which refusal conditions were evaluated, and what confidence tier the output was assigned. The Health Catalyst store is the litigation-grade document — append-only, cryptographically timestamped, and exportable in the format requested by Mayo's legal team after consultation with outside counsel on e-discovery standards.
The reasoning trace element is new for clinical AI at this scale. Prior-generation clinical decision support tools logged outputs but not reasoning pathways. The CARS framework requires that any Tier 1 or Tier 2 output include a structured attribution field: which data elements drove the recommendation, in what priority order, and what the confidence interval was on the primary differential. This field is visible to the treating clinician in a collapsed sidebar within the Epic interface, and visible in full to quality review teams and, if subpoenaed, to opposing counsel in litigation.
Aldridge's team spent considerable effort on what they call the "explainability threshold" — the minimum level of reasoning transparency required before a capability can be deployed at Tier 1 or above. Two capabilities that passed all other CARS eval phases failed at the explainability threshold in 2023 because the underlying models could not generate attribution fields that met the clinical review panel's standard for comprehensibility. Both capabilities remained at Tier 0 — summarisation only — through year-end. The explainability requirement is not a regulatory mandate; it is a self-imposed governance constraint. That self-imposition is the point.
What to watch
Mayo's governance architecture will iterate through 2024. The variables that determine whether this deployment becomes a sector standard or a one-institution exception are narrow but legible.
- Whether the FDA's Digital Health Center of Excellence issues formal clearance letters for any of Mayo's four Pre-Submitted capabilities — the first such clearances for an advisory diagnostic agent at a major health system would reset the regulatory baseline for every competitor.
- Whether Tier 3 capabilities — standing-order initiation — advance to production deployment before the end of fiscal year 2024; this is the liability step-function that separates advisory AI from autonomous clinical action.
- Whether the CARS framework is licensed or released as an open standard; several peer institutions, including Cleveland Clinic and Mass General Brigham, are understood to be in early-stage discussions with Mayo's informatics team about adopting compatible evaluation architectures.
- Whether Epic incorporates the reasoning-trace audit schema into its core EHR build — Epic's product roadmap signals this capability for 2025, which would lower the infrastructure cost for any institution attempting to replicate Mayo's audit architecture.
- Whether the Q1 2024 drift monitoring results hold above threshold; four consecutive quarters of clean control charts would give Mayo's compliance team the statistical foundation to argue for expanded Tier 2 deployment across ambulatory specialties currently excluded from the live programme.
Frequently asked
- What exactly is the Clinical AI Reliability Standard, and who governs it?
- CARS is Mayo Clinic's internally developed evaluation framework for clinical AI capabilities, governed by the Office of the CMIO. It defines four deployment tiers, specifies four pre-deployment validation phases, and mandates continuous post-deployment drift monitoring via statistical process control. It is not a published standard and has not been adopted by a regulatory body, though its counterfactual testing methodology has received informal endorsement from the FDA's Digital Health Center of Excellence in Pre-Submission correspondence.
- Why did Mayo file Pre-Submissions with the FDA rather than arguing its agents fall outside SaMD scope?
- Because the legal calculus favors proactive alignment. If an agent is later determined to be a Software as a Medical Device — through litigation, adverse event investigation, or formal FDA rulemaking — a health system that engaged voluntarily faces a dramatically lower enforcement and liability exposure than one that did not. Mayo's legal team made this determination in mid-2022 and structured the deployment timeline accordingly.
- What is the "explainability threshold" and why did two capabilities fail it?
- The explainability threshold requires that any Tier 1 or Tier 2 capability can generate a structured attribution field showing which data elements drove the output, in what priority order, and at what confidence. Two capabilities — both involving multi-modal inputs combining imaging and unstructured clinical notes — could not produce attribution fields that the clinical review panel judged comprehensible to a practising attending physician. They were reclassified to Tier 0 pending model interpretability improvements.
- How does the escalation refusal taxonomy work in practice for a clinician receiving a refusal response?
- When an agent triggers a refusal, the Epic interface displays a plain-language message identifying the refusal category — clinical urgency, liability threshold, or regulatory bright line — and prompts the clinician with a structured next-step recommendation. The encounter is simultaneously logged, the case enters a human-review queue with a four-hour maximum response time, and a named attending is assigned accountability. The clinician does not see which of the 14 refusal branches fired; they see only the category and the action prompt.
- Is Mayo's audit architecture replicable by a smaller health system without Health Catalyst infrastructure?
- Partially. The FHIR R4-compliant immutable log is infrastructure-specific, but the underlying requirement — a cryptographically timestamped, append-only record of agent reasoning traces — can be replicated with any cloud-native FHIR store. AWS HealthLake and Google Cloud Healthcare API both support the required standard. The more significant barrier for smaller institutions is the clinical review panel structure: the standing committee of eight senior clinicians reviewing 200 sampled outputs per quarter requires institutional commitment that many mid-tier systems have not yet formalised.
The second-order effect
The deployment itself is a fact. The governance architecture is a signal. What Mayo demonstrated in Q4 2023 is that a major health system can run diagnostic agents at meaningful scale — 680,000 interactions in a single quarter — while maintaining a documentation and accountability structure that satisfies clinical, legal, and regulatory scrutiny simultaneously. That demonstration changes the negotiating position of every health system now sitting across the table from an AI vendor. The vendor's demo no longer closes the deal. The governance question does. Health systems that cannot answer how they will handle drift monitoring, escalation attribution, and FDA Pre-Submission are now visibly behind a benchmark that did not exist 18 months ago.
Aldridge's team will not have published the CARS framework publicly by the time this briefing goes to press. But the clinical informatics community is small, its conferences dense with documentation sharing, and the framework's architecture is already circulating through the Health Level Seven International working groups that set interoperability standards. The institutions that move on governance first will not simply be safer — they will be faster, because the regulatory pathway for institutions with established evaluation track records will compress relative to those starting cold. That asymmetry is the intelligence here.
More from Health →