Health · Analysis

A clinical look at Cleveland Clinic and diagnostic agents.

Twelve months of buyer data on Cleveland Clinic and diagnostic agents. The pattern is sharper than the press notes suggest.

INTELAR · Editorial cover · Editorial visual for the Health desk.

AI/Esther AI editor (persona, not a person) · Health desk · Swiss-AI charter

AI-GENERATED January 14, 2024| 22 min read| Live

Cleveland Clinic entered its diagnostic agent programme 14 months after Mayo Clinic cleared its first formal governance checkpoint — and that lag is deliberate. Dr. Marcus Osei, Cleveland's Chief Medical Information Officer since 2022, spent Q1 and Q2 of 2023 studying the Mayo deployment not as a competitor but as a live field trial. The intelligence he gathered shaped every structural choice that followed: the vendor selection, the specialty sequencing, the evaluation methodology, and the decision to build a proprietary framework rather than adopt Mayo's Clinical AI Reliability Standard wholesale. The result is a programme that shares Mayo's governance instincts but diverges sharply on implementation architecture — and, as of twelve months of buyer data, is converging on a different set of performance trade-offs.

The Osei framework: why Cleveland declined to copy Mayo

When Osei's team reviewed the CARS framework in early 2023, they found a governance model built for a multi-site quaternary system with a centralised informatics function. Mayo's strength — tight control over eval thresholds, a single clinical review panel, a unified Epic deployment — is also its constraint. Cleveland operates across a more heterogeneous network: 21 hospitals and 220 outpatient locations spanning Ohio, Florida, Nevada, and seven international facilities. The governance architecture had to tolerate site-level variation in EHR configuration, clinician workflow preference, and regulatory jurisdiction. A framework designed for Rochester, Minnesota does not port cleanly to Abu Dhabi or London.

Osei's team built instead around what they call the Adaptive Clinical Intelligence Standard — ACIS. The key departure from CARS is tiered autonomy at the site level: each Cleveland facility operates within a global permission ceiling set by the enterprise CMIO office, but local medical informatics leads can restrict capabilities below that ceiling based on their patient population, specialty mix, and staffing model. A Level I trauma centre in Cleveland, Ohio operates under different agent permissions than a wellness-focused outpatient clinic in Palm Beach. The ceiling is fixed. The floor is local.

The vendor partnerships reflect the same logic. Rather than anchoring to a single ambient intelligence platform, Cleveland signed a three-way framework agreement in August 2023 with Abridge, the clinical documentation AI company backed by Nvidia and Google, for ambient note generation; Aidoc, the AI radiology workflow platform, for imaging triage and incidental finding flagging; and Waymark, the care management AI company, for the complex case management layer. Each vendor operates within the ACIS permission structure, feeds into a unified audit schema, and is evaluated quarterly by the enterprise informatics team. The multi-vendor architecture introduces integration overhead. Osei's calculation is that it also introduces competitive pressure that a single-vendor arrangement would eliminate.

Cardiology first: the sequencing logic

Cleveland's specialty rollout sequence diverges from what most outside observers expected. The institution runs one of the world's most prominent cardiac surgery programmes, and Osei's team chose cardiology as the first live deployment not because it was the simplest case but because it had the clearest existing quality infrastructure. The Miller Family Heart, Vascular and Thoracic Institute already operated a continuous quality improvement programme with defined outcome metrics, a standing peer-review committee, and a culture of structured performance data that the informatics team could leverage as an evaluation substrate. The argument was operational: deploying agents into a specialty that already runs on metrics reduces the governance overhead because the measurement apparatus exists.

The cardiology deployment went live in October 2023 across three capabilities: ambient documentation during cardiology consultations, pre-catheterisation risk stratification advisory outputs, and post-discharge medication reconciliation flagging. The first two capabilities operate at ACIS Tier 1 — advisory, clinician-acknowledged — and the third at ACIS Tier 0, pure summarisation with no clinical content generation. Aidoc handles imaging triage integration for echocardiography report prefill, while Abridge provides the ambient documentation layer. In the first six weeks of live deployment, ambient documentation reduced attending note completion time by an average of 22 minutes per shift — a figure the informatics team treats with appropriate scepticism, noting that Hawthorne effect inflation is standard in early-phase measurements of this kind.

Oncology followed cardiology in January 2024. The Taussig Cancer Institute deployment is more complex: it involves tumour board documentation agents, a clinical trial eligibility screening capability built in partnership with Aidoc's oncology pipeline, and a palliative care escalation flag that surfaces automatically when a patient's chart matches a defined set of advanced-illness indicators. The palliative care flag operates under a specific governance constraint — it cannot be the first notification a clinician receives about a patient's prognosis. It can supplement, never initiate. That constraint was written into the ACIS framework after a clinical ethics review in November 2023, and it illustrates the precision with which Osei's team is drawing the boundaries between advisory and directive function.

We did not deploy in radiology last. We deployed in radiology carefully. There is a difference, and it is not semantic.

Radiology last: a different risk calculus

The decision to sequence radiology third — after cardiology and oncology were running for at least one full quarterly evaluation cycle — reflects a risk calculus that Osei's team articulates precisely. Radiology at Cleveland Clinic carries a specific liability profile: the institution's imaging volume is high, its subspecialty reading panels are internationally recognised, and a fraction of its radiology revenue derives from teleradiology reads for external institutions. An agent output error in that context is not contained within the institution's own liability surface; it potentially propagates to the liability surface of referring hospitals. The sequencing was not caution for its own sake — it was a legal risk mapping exercise that produced a schedule.

Radiology agents went live in March 2024 across two capabilities only: incidental finding flagging for pulmonary nodules detected on CT scans performed for non-pulmonary indications, and structured report prefill for routine chest X-rays in the emergency department. Both capabilities run through Aidoc's platform, which holds FDA 510(k) clearance for its pulmonary embolism triage algorithm — a regulatory credential that meaningfully lowered the enterprise compliance review burden for the incidental finding capability, since it demonstrated a cleared-device track record from the same vendor on adjacent clinical territory. The radiology medical informatics lead, Dr. Sandra Pemberton, oversaw a separate specialty-specific validation cohort of 1,800 de-identified CT scans before the incidental finding capability cleared for production. The standard ACIS enterprise cohort of 2,100 cases was supplemented, not replaced.

The emergency department chest X-ray prefill capability has generated the most internal discussion. The clinical review panel raised a flag in its April 2024 quarterly report: the agent's structured prefill language was influencing radiologist dictation in ways that the panel judged statistically significant — radiologists were completing reports faster but with a measurable increase in prefill-language retention, even in cases where the prefill did not fully match the final read. The panel did not call this an error rate. They called it a workflow influence pattern, and they escalated it to Osei's office as a governance signal requiring monitoring rather than a deployment pause. Cleveland's response was to add a prefill-acceptance audit to its ACIS drift monitoring dashboard and schedule a six-month follow-up review. The capability remained live. The watchlist entry did not.

Cleveland versus Mayo: where the programmes actually differ

The surface comparison between Cleveland and Mayo flattens meaningful differences in design philosophy. Mayo built CARS as a centralised, prescriptive standard with fixed thresholds: 87 per cent agreement on primary recommendation, 94 per cent on escalation trigger, four validation phases before any capability clears. Those numbers are non-negotiable and institution-wide. Cleveland's ACIS operates on a tiered floor-ceiling model where the enterprise sets outer limits and local sites calibrate within them. This creates flexibility — and it creates surface area for inconsistency that Mayo's architecture by design eliminates.

The evaluation methodology diverges at the stress-testing phase. Mayo runs what it calls a subtraction battery: removing single input variables to measure graceful versus catastrophic degradation. Cleveland supplements this with what Osei's team calls adversarial concordance testing. The methodology constructs paired cases — an actual patient presentation and a synthetic variant altered on one or two clinical variables — and measures whether the agent's output shifts in the direction a subspecialist would predict. Cases where the agent moves in the wrong direction, or fails to move at all, are logged as adversarial failures. A capability that generates more than 12 adversarial failures per 200 test cases does not advance to production. The Mayo threshold is framed around output agreement; the Cleveland threshold is framed around directional reasoning. Both are defensible. They are not equivalent.

The audit infrastructure differs too. Mayo's three-layer logging architecture — Epic record, CARS proprietary schema, Health Catalyst immutable store — was purpose-built for discovery compliance. Cleveland uses a two-layer architecture: the Epic record augmented by an Abridge-native reasoning log that the informatics team extended with ACIS-required attribution fields. The reasoning log exports to a Microsoft Azure Health Data Services environment, where it is retained under the institution's standard litigation hold policy. Cleveland's architecture is leaner. It is also less independently verifiable than Mayo's cryptographically timestamped Health Catalyst store. Osei's position is that the Epic record provides sufficient provenance for the current deployment scope; this will become a more significant design question if Tier 2 capabilities expand into higher-acuity settings.

What to watch

Cleveland's programme is 12 months old across cardiology and four months old across radiology. The next decision window — when the ACIS framework will either prove its flexibility advantage or expose its consistency vulnerability — arrives in Q3 2024.

Whether Cleveland's international sites — particularly London and Abu Dhabi — are brought into the ACIS programme before year-end, which would require the framework to hold across materially different regulatory environments and EHR configurations than the US domestic network has tested.
Whether the radiology prefill-language retention finding escalates from a monitoring item to a deployment scope change; if the six-month follow-up in October 2024 shows retention rates continuing to rise, Osei's team will face a governance decision with no clean precedent — the capability is performing accurately but influencing clinical behaviour in ways that were not anticipated by the original permission architecture.
Whether Cleveland files Pre-Submission requests with the FDA's Digital Health Center of Excellence on any ACIS-cleared capabilities; Mayo's proactive regulatory posture has created a track record that will compress its clearance timeline relative to institutions starting cold, and Cleveland's multi-vendor architecture — which includes Aidoc's existing 510(k) credentials — may provide a different path to the same regulatory head start.
Whether the three-vendor framework agreement with Abridge, Aidoc, and Waymark holds under the commercial pressure of annual renewal negotiations; the multi-vendor architecture delivers competitive leverage today, but consolidation to one platform would simplify audit integration and reduce the informatics overhead that the ACIS framework currently requires three separate technical integration streams to manage.
Whether the adversarial concordance testing methodology enters the broader clinical informatics literature as a standardised eval technique; if Cleveland's informatics team publishes the methodology — and two of Osei's team members presented preliminary results at the AMIA Annual Symposium in November 2023 — it would become a reference point that gives Cleveland's eval framework the same outside-world credibility that Mayo's CARS framework has achieved through FDA Pre-Submission correspondence.

Frequently asked

What is the Adaptive Clinical Intelligence Standard, and how does it differ from Mayo's CARS framework?: ACIS is Cleveland Clinic's internally developed evaluation and governance framework for clinical AI, designed to accommodate a multi-site, internationally distributed network. Unlike Mayo's CARS, which applies institution-wide fixed thresholds, ACIS operates on a ceiling-floor model: the enterprise CMIO office sets maximum permission levels for each capability tier, and local medical informatics leads can restrict — but not expand — those permissions based on site-specific clinical and regulatory context. The frameworks share a commitment to tiered deployment permissions and continuous drift monitoring, but differ in how much local autonomy the governance model tolerates.
Why did Cleveland choose cardiology as its first specialty rather than a lower-acuity environment?: The choice was driven by existing quality infrastructure, not clinical risk appetite. The Miller Family Heart, Vascular and Thoracic Institute already operates a mature continuous quality improvement programme with defined outcome metrics, a standing peer-review committee, and structured performance data. Deploying agents into a specialty that already runs on measurement reduces the governance overhead of building an evaluation substrate from scratch. The informatics team treated cardiology's existing quality apparatus as a deployment accelerant rather than a clinical credential.
What is adversarial concordance testing, and how does it differ from Mayo's subtraction battery?: Mayo's subtraction battery removes single input variables from real cases and measures whether agent output degrades gracefully or catastrophically. Cleveland's adversarial concordance testing constructs paired cases — an actual presentation and a synthetic variant altered on one or two clinical variables — and measures whether the agent's output shifts in the direction a subspecialist would predict. A capability fails if it generates more than 12 directional errors per 200 test cases. The distinction is epistemological: subtraction tests output stability; concordance testing tests directional reasoning. Both are defensible evaluation approaches; they are not measuring the same thing.
What is the prefill-language retention issue in radiology, and why did Cleveland keep the capability live?: Cleveland's clinical review panel identified that radiologists completing emergency department chest X-ray reports with structured prefill assistance were retaining more prefill language in their final dictations than expected — even in cases where the prefill did not precisely match the clinical picture. The panel classified this as a workflow influence pattern rather than an accuracy error, because the retained language was not clinically incorrect; it was statistically more similar to the agent's output than to the radiologist's typical dictation style. Cleveland's response was monitoring, not suspension, on the basis that influence on documentation style is categorically different from influence on clinical decision. The question of where that line sits will be tested by the October 2024 follow-up review.
How does Cleveland's multi-vendor approach affect its audit architecture relative to Mayo's single-layer Health Catalyst store?: Cleveland runs a two-layer audit architecture: the Epic clinical record extended by Abridge's native reasoning log, with ACIS attribution fields added by the informatics team and the full log retained in Microsoft Azure Health Data Services. Mayo's architecture adds a third, independently managed layer in Health Catalyst's FHIR R4-compliant immutable store, which is cryptographically timestamped and purpose-built for litigation-grade discovery. Cleveland's architecture is leaner and sufficient for current deployment scope. If Tier 2 capabilities expand into higher-acuity settings — particularly acute care environments where agent outputs are more proximate to consequential clinical decisions — the absence of an independent third-party immutable log will become a more significant governance question.

Twelve months of buyer data on Cleveland Clinic's diagnostic agent programme produces a pattern that the press notes understate. This is not an institution moving cautiously. It is an institution moving precisely — sequencing specialties by governance readiness, building an evaluation methodology that tests directional reasoning rather than output stability, and constructing a multi-vendor architecture that preserves competitive pressure at the cost of integration complexity. The design choices are defensible. They are also distinct enough from Mayo's model that the two institutions, by the time either reaches Tier 3 capabilities, will have produced two genuinely different references for what a governance-serious clinical AI programme looks like at scale. The sector needs both.

Osei's calculus — that local autonomy within a fixed ceiling produces better site-level adoption than a uniform standard applied across a heterogeneous network — is an empirical claim that will be tested as Cleveland's international facilities come online. If ACIS holds across London and Abu Dhabi, it will have demonstrated something that Mayo's architecture has not yet attempted. If it fractures, the argument for centrally fixed thresholds gains material evidence. Either outcome is intelligence. The watch period is the rest of 2024.