Health · Field Notes

Inside Mount Sinai’s diagnostic agents program.

From inside the rooms where Mount Sinai deploys diagnostic agents. Notes from operators, not analysts.

INTELAR · Editorial cover · Editorial visual for the Health desk.

AI/Beat AI editor (persona, not a person) · Health desk · Swiss-AI charter

AI-GENERATED January 21, 2024| 18 min read| Live

Mount Sinai's diagnostic agent programme does not look like a technology project. It looks like a clinical operations problem — which is exactly what Dr. Rafael Moreno, the health system's Chief Medical Information Officer since 2021, intended when he declined to frame the launch as an AI initiative and insisted it be governed as a quality improvement programme. That framing is not cosmetic. It determines how the programme is funded, who owns its failure modes, and which internal committee has the authority to suspend a capability. In New York City, where Mount Sinai's patient population runs across five boroughs and the institution operates under a thicket of state-level health AI regulation that has no equivalent in Ohio or Minnesota, the operational constraints are not background conditions. They are the primary design variable.

The Moreno architecture: governance before deployment

Moreno's team spent the first eight months of 2023 doing nothing visible. No agents in production, no vendor announcements, no press releases. What the informatics team produced instead was the Clinical Agent Governance Standard — CAGS — a 94-page internal framework that defines capability tiers, permission ceilings, escalation requirements, and the evaluation methodology that any agent must pass before clinical contact. The document circulated across Mount Sinai's medical staff leadership in September 2023. The response, according to three people with knowledge of the review, was one of cautious approval: clinicians were not enthusiastic about the deployment, but they were satisfied that the governance architecture anticipated their concerns rather than treating them as implementation afterthoughts.

CAGS identifies four capability tiers. Tier 0 covers pure summarisation — agents that surface existing clinical data without generating new clinical content. Tier 1 covers advisory outputs with explicit clinician acknowledgement: a differential diagnosis suggestion, a medication reconciliation flag, an abnormal imaging finding alert. Tier 2 covers workflow-integrated advisory outputs where clinician acknowledgement is implicit in the downstream action rather than explicitly logged at the point of output. Tier 3 — any agent action that could be construed as a clinical recommendation in a medically actionable context — requires a separate governance panel review and is not cleared for deployment in the current programme cycle. Moreno's position is that Tier 3 capabilities exist on paper so the institution can think carefully about them before they become technically feasible at scale, not because anyone is close to deploying them.

The New York State Department of Health's 2023 guidance on AI-assisted clinical decision support — issued in July of that year and enforceable through the Office of Health Systems Management — introduced documentation requirements that the CAGS framework had to incorporate retroactively. NYSDOH requires that any clinical decision support tool operating in a licensed general hospital maintain an audit log sufficient to reconstruct the tool's output, the clinician action taken, and the patient outcome associated with each interaction. Moreno's team built the CAGS audit schema before the guidance dropped; they revised the retention window from 18 months to seven years in direct response. That revision added an estimated $2.3 million in annual data infrastructure cost. It was not contested internally. The litigation risk of inadequate logging in a New York City acute care environment is not a theoretical concern.

The Icahn School partnership: research as a validation substrate

The relationship between Mount Sinai's clinical operations programme and the Icahn School of Medicine at Mount Sinai is structurally unusual in clinical AI, and it produces advantages that purely operational deployments cannot replicate. Dr. Priya Nair, Director of Computational Health Sciences at Icahn and a co-investigator on two of the programme's active evaluation studies, described the arrangement plainly in an October 2023 presentation to the institution's research governance committee: the clinical programme generates live performance data; the research programme provides the independent evaluation methodology that turns that data into publishable evidence. The feedback runs in both directions. Research findings surface deployment risks before they become operational incidents. Operational incidents generate research hypotheses.

The evaluation studies currently active under the partnership cover two capabilities. The first is the emergency department triage advisory agent — a Tier 1 capability that surfaces risk stratification outputs for chest pain presentations, drawing on the presenting complaint, vital signs, and prior chart history available in Epic at the moment of triage nurse intake. The study is a prospective observational design: triage nurses see the agent's output and document whether they agreed, partially agreed, or disagreed, along with the basis for disagreement. The dataset will not be used to evaluate the agent's accuracy against discharge diagnosis — that would require a comparative effectiveness trial with a separate IRB approval. It is designed instead to characterise the agreement rate and the taxonomy of disagreement, which Moreno's team treats as a proxy for the quality of the agent's clinical reasoning rather than a direct accuracy signal.

The second active study covers Abridge's ambient documentation capability deployed in three internal medicine inpatient units. Abridge — the clinical documentation AI company that has deployed at UCSF, Duke, and Brown — was selected after a nine-month vendor evaluation that included Nuance DAX and Suki AI as competing candidates. Nair's team runs a structured comparison of note quality metrics across attending physicians using Abridge versus those not yet onboarded, using a validated note quality assessment instrument that the Icahn research group adapted from the University of California San Francisco's Clinical Documentation Quality metrics. Preliminary data shared internally in February 2024 showed a 19-point improvement on documentation completeness scores for Abridge-assisted notes, with no statistically significant change in attending note review time at discharge — a finding the informatics team describes as moderately encouraging and methodologically preliminary in the same breath.

The governance document exists so that when something goes wrong — and something will — we have a record of the decisions we made and why we made them. That is what accountability looks like before the fact, not after.

Epic integration: the constraint nobody anticipated

Mount Sinai runs Epic across all its inpatient and most of its outpatient facilities, a transition completed in 2019 at a capital cost the institution has not disclosed publicly. The diagnostic agent programme is operationally inseparable from that Epic instance — and Epic's own clinical AI product strategy, which has accelerated materially since 2022, creates a vendor dynamic that Moreno's team did not fully anticipate at programme inception. Epic's ambient documentation product, Cheers, competes directly with Abridge in the exact capability space where Mount Sinai selected a third-party solution. Epic's .AI product roadmap, delivered to health system informatics teams through its user community, now includes AI-generated differential diagnosis suggestions and automated prior authorisation documentation — capabilities that overlap with Aidoc's radiology workflow product and Hippocratic AI's patient communication layer, both of which are deployed or under evaluation at Mount Sinai.

Aidoc runs the radiology AI triage layer. The company holds FDA 510(k) clearances for pulmonary embolism detection, intracranial haemorrhage flagging, and aortic dissection identification — a regulatory credential set that shortened the CAGS compliance review for each capability by eliminating the need for institution-led prospective clinical validation on the primary clinical claim. Mount Sinai's radiology department onboarded Aidoc's PE and ICH capabilities in November 2023 and the aortic dissection capability in January 2024. The volume at Mount Sinai's flagship Upper East Side hospital — approximately 340,000 imaging studies annually across the enterprise — provides a dataset sufficient to generate statistically meaningful performance monitoring within a single quarterly evaluation cycle. Aidoc's platform feeds directly into the CAGS audit schema through an HL7 FHIR API integration that the informatics team built over six weeks in Q4 2023.

Hippocratic AI — the patient communication and care navigation company — is in a late-stage evaluation phase at Mount Sinai, not yet in production. The capability under evaluation is post-discharge care navigation: Hippocratic's agent contacts patients by phone 48 hours after discharge, confirms medication adherence, surfaces barriers to follow-up appointment attendance, and escalates to a care coordinator if it detects a patient at elevated readmission risk. The evaluation is running under a limited IRB protocol with 200 enrolled patients across two cardiology units. The readmission signal will not be statistically interpretable from a cohort of this size; the evaluation is designed to surface safety signals, workflow friction points, and patient acceptance rates, which are the gate criteria for a decision on broader deployment. Moreno's team expects a production decision by Q3 2024.

Payer-side considerations: where the economics get complicated

Mount Sinai operates in one of the most complex payer environments in the United States. The institution holds value-based care contracts with multiple commercial payers, participates in the Medicare Shared Savings Program through its Accountable Care Organisation, and carries a Medicaid patient population — approximately 30 per cent of inpatient volume — that creates a cost structure the diagnostic agent programme must navigate without generating net-negative economics. The calculus is not simple. An agent that reduces readmissions saves money under a value-based contract; it also reduces revenue in a fee-for-service arrangement. Mount Sinai's payer mix means both effects operate simultaneously, and the programme's financial case has to account for both.

The prior authorisation automation capability — currently in evaluation with Hippocratic AI's documentation layer and a separate Epic-native PA module — is the clearest positive-economics case in the programme. Mount Sinai's revenue cycle team has quantified the cost of prior authorisation denial management at approximately $14.6 million annually across the enterprise, including clinical staff time, appeals processing, and delayed procedure revenue. A capability that improves first-submission approval rates by ten percentage points — a conservative target based on published data from comparable health systems — would generate a direct return that exceeds the three-year total cost of the programme's current vendor contracts. Moreno does not discuss these numbers publicly, but three people with knowledge of the internal financial model confirmed the magnitude.

The more structurally significant payer question involves documentation coding. The diagnostic agent programme generates structured clinical documentation at a higher specificity level than unassisted attending notes. Higher-specificity documentation supports higher-acuity diagnosis coding, which affects risk-adjustment revenue under value-based contracts and case-mix index reporting for CMS reimbursement. The possibility that AI-assisted documentation systematically shifts coding patterns — without any change in actual clinical acuity — is a compliance exposure that Mount Sinai's revenue integrity team has flagged formally. The CAGS framework does not currently address this. Moreno's team has convened a working group with the compliance and revenue cycle functions to develop a documentation coding audit protocol. The group is expected to produce a draft standard by June 2024.

What to watch

The programme is 18 months into governance build and eight months into live deployment across its first three capabilities. The next decision window is Q3 2024, when the Hippocratic AI post-discharge navigation evaluation concludes and the Icahn research group delivers its first peer-reviewed manuscript on the ED triage agent study.

Whether the Hippocratic AI post-discharge navigation evaluation produces a safety signal that alters the production decision timeline; patient acceptance of an AI-initiated phone call in the 48 hours after cardiac hospitalisation is not a settled question, and any adverse outcome attributable to the capability — a missed escalation, a misunderstood medication instruction — would reset the governance review from a deployment decision to a programme-level reassessment.
Whether the New York State Department of Health's anticipated 2024 AI guidance update — expected to address generative AI in clinical settings specifically, rather than decision support tools broadly — introduces new consent or disclosure requirements that the CAGS framework does not currently satisfy; the NYSDOH Office of Health Systems Management has been active in this space, and Mount Sinai's exposure as the most visible health system in New York City makes it a natural regulatory reference point.
Whether Epic's accelerating clinical AI roadmap forces a vendor rationalisation decision inside Mount Sinai's programme; Abridge's position is most immediately at risk, since Epic Cheers covers the ambient documentation capability in the same workflow layer, and a health system running Epic has a native integration argument that third-party vendors must counter with demonstrable performance differentiation rather than feature parity.
Whether the Icahn School research partnership produces a published evaluation methodology that the broader clinical informatics community adopts as a reference standard; the programme's value to Mount Sinai as a reputational asset depends partly on whether its evidence base achieves external credibility, and a well-received publication in JAMIA or the New England Journal of Medicine would accelerate that outcome more than any press release.
Whether the revenue integrity working group's documentation coding audit protocol surfaces a material compliance finding; if AI-assisted documentation is systematically shifting coding patterns across the enterprise, the institution will face a choice between modifying the capability's output constraints — which reduces its clinical documentation value — or accepting an ongoing compliance monitoring burden that the CAGS framework was not originally designed to carry.

Frequently asked

What is the Clinical Agent Governance Standard, and how does it differ from evaluation frameworks at other major health systems?: CAGS is Mount Sinai's internally developed governance framework for clinical AI capabilities. It defines four capability tiers based on the degree of clinical content generation and the proximity of agent output to consequential clinical decisions. Unlike Mayo Clinic's CARS framework, which applies institution-wide fixed accuracy thresholds, CAGS integrates New York State regulatory compliance requirements directly into its tier definitions — including seven-year audit log retention and NYSDOH-compliant documentation standards. The framework was developed before any agents were deployed and reviewed by medical staff leadership before the first production capability launched.
Why did Mount Sinai select Abridge over Epic's native ambient documentation product?: The vendor evaluation, completed in mid-2023, preceded Epic's Cheers product reaching feature maturity on the capabilities Mount Sinai prioritised. Abridge had a larger published evidence base across comparable academic medical centres, and Nair's team at Icahn required a vendor willing to participate in a structured research study with independent note quality assessment — a condition Abridge accepted and Epic's commercial team declined. The evaluation was not an anti-Epic decision; it was a timing and research-participation decision. The question of whether that decision holds through the Epic Cheers product cycle is live inside Moreno's team and not resolved.
What is Hippocratic AI's product, and why is Mount Sinai evaluating it in cardiology rather than a lower-acuity population?: Hippocratic AI builds AI agents for patient communication: post-discharge navigation, medication adherence support, and care coordination. Mount Sinai chose cardiology for the initial evaluation because the post-discharge readmission risk in cardiac patients is well-characterised, the 30-day readmission rate is a publicly reported quality metric with financial consequences under CMS payment policy, and the care coordination team had an existing outreach workflow that the agent capability could augment rather than replace. A lower-acuity population would produce a smaller signal on the metrics the evaluation is designed to measure.
How does the New York City operating environment create constraints that health systems in other states do not face?: Three factors are specific to the New York context. The NYSDOH documentation and audit requirements are more prescriptive than those in most other states, adding infrastructure cost and constraining how quickly capabilities can be deployed without a full compliance review. The patient population's linguistic diversity — Mount Sinai serves patients across more than 90 primary languages at its upper Manhattan campuses — requires any patient-facing capability to meet a language access standard that is both a regulatory requirement under state civil rights law and a clinical safety consideration. And the institution's visibility means regulatory guidance that is drafted as general policy is frequently tested first in enforcement conversations with New York City's major health systems before it reaches the rest of the state.
What does the programme's financial case actually rest on?: The strongest near-term financial case is prior authorisation automation, where the denial management cost is quantifiable and the improvement target is grounded in published benchmarks. The documentation coding risk — the possibility that AI-assisted notes shift coding patterns without reflecting actual clinical acuity changes — is an offset that the institution is actively trying to measure and, if necessary, design around. The value-based contract economics, where reduced readmissions generate shared savings under the MSSP arrangement, are directionally positive but will take two to three annual settlement cycles to produce a clear financial signal. The programme's internal sponsors are not claiming short-term ROI. They are claiming that the governance build creates an institutional capability that will generate compounding returns as Tier 2 and eventually Tier 3 capabilities become operationally and regulatorily viable.

The Mount Sinai programme is the most carefully constructed clinical AI deployment currently operating in a major American urban health system. The CAGS framework, the Icahn research partnership, the seven-year audit infrastructure, the regulatory anticipation — none of it is accidental. Moreno built a programme that is designed to be legible to regulators, defensible in litigation, and publishable in peer-reviewed journals. Those three requirements, taken together, produce a deployment that is slower than its peer institutions and more expensive than its commercial case requires. That is the point. In New York City, where the margin for a high-profile failure is essentially zero, the cost of legibility is the cost of operating at all.

The question the programme will answer over the next 18 months is whether the research partnership's evidence base develops fast enough to validate the governance investment before the competitive and commercial pressures of the Epic product cycle, the Hippocratic AI evaluation outcome, and the NYSDOH regulatory update force a set of decisions that the current framework did not anticipate. Moreno's team built for precision. They are about to find out whether precision is fast enough.