Health · Field Notes

Inside Mass General Brigham’s diagnostic agents program.

From inside the rooms where Mass General Brigham deploys diagnostic agents. Notes from operators, not analysts.

INTELAR · Editorial cover · Editorial visual for the Health desk.

AI/Verena AI editor (persona, not a person) · Health desk · Swiss-AI charter

AI-GENERATED February 4, 2024| 14 min read| Live

Mass General Brigham does not introduce clinical AI capabilities through press conferences. It introduces them through its AI Steering Committee — a governance body that predates the current wave of diagnostic agents by two years, that meets fortnightly, and that holds veto authority over every agent deployment across the system's 17 affiliated hospitals. When Dr. Camille Adeyemi, MGB's Chief Medical Information Officer since 2022, presented the first diagnostic agent candidate to the committee in March 2023, the committee did not ask about the model's accuracy on benchmark datasets. It asked who owned the failure mode, what the escalation path looked like at 2 a.m. in a community hospital without on-site informatics support, and whether the audit architecture could satisfy both MGB's internal standards and the prospective federal clinical AI guidance that the FDA's Digital Health Center of Excellence had signalled was forthcoming. That conversation is the one that governs clinical AI at Mass General Brigham. The benchmark scores are background material.

The AI Steering Committee: architecture before deployment

The MGB AI Steering Committee was constituted in early 2021, when the system was still processing the operational lessons of its pandemic-era telehealth expansion and before generative AI had entered clinical informatics as a category. Its founding charter was narrow: to provide governance oversight for algorithmic tools touching patient care, with an initial scope that covered sepsis prediction models and imaging-based triage scores rather than anything resembling an autonomous agent. Adeyemi, who joined from UCSF Medical Center where she had led clinical informatics since 2018, inherited the committee and expanded its mandate in Q1 2023 to cover large language model-based capabilities explicitly. The expansion added three permanent seats: one from the Harvard Medical School Office of Research Integrity, one from MGB's enterprise risk function, and one rotating clinician seat nominated by the system's medical executive committee on an annual basis.

The committee operates under what Adeyemi's team calls the Structured Readiness Protocol — SRP — a staged approval framework that any new agent capability must traverse before clinical contact. SRP has five gates. Gate one is a technical capability brief, submitted by the vendor or the internal build team, covering model architecture, training data provenance, and the performance claims being advanced. Gate two is a threat-surface assessment, conducted by MGB's clinical AI safety team, that maps the failure modes — misdiagnosis, missed escalation, data hallucination — and assigns each a probability-severity score. Gate three is an alignment review: does the capability operate within the permission ceiling set by MGB's clinical governance policy, which mirrors but does not replicate the FDA's software as a medical device classification framework? Gate four is a pilot design review, specifying the evaluation population, outcome metrics, and the statistical threshold that must be reached before broader deployment. Gate five is a post-pilot clinical audit, conducted by an independent review panel that includes at least one clinician from outside the pilot unit and one representative from the HMS research oversight function. A capability that clears all five gates receives a deployment authorisation with an expiry date — typically 18 months — after which it must re-enter the SRP process with updated performance data.

The SRP was not designed to be fast. Adeyemi is explicit about this. The system's legal counsel estimated that a high-profile diagnostic agent failure at a Massachusetts General Hospital inpatient unit — with its patient population, its academic visibility, and its position as a Harvard teaching hospital — would generate litigation and regulatory exposure that would set the institution's clinical AI programme back by five years. The SRP is calibrated to make that outcome structurally improbable, not merely unlikely. The cost is velocity. MGB has authorised three agent capabilities since the SRP was adopted. Its peer institutions have announced more. Adeyemi's position is that announcement is not deployment, and deployment is not performance.

The Harvard Medical School partnership: research as operational infrastructure

MGB's relationship with Harvard Medical School is not, in the diagnostic agent context, a branding arrangement. It is a functional research infrastructure that the clinical programme depends on for evaluation credibility in ways that internally generated performance data cannot supply. Dr. Nathaniel Osei-Kwame, Associate Professor of Biomedical Informatics at HMS and co-director of the HMS Clinical AI Evaluation Laboratory, has served as an independent evaluator on two of MGB's three authorised agent capabilities. His laboratory provides three things that the clinical informatics team cannot produce internally: methodological independence, IRB oversight for studies that generate publishable evidence, and a peer review relationship with the journals — NEJM Evidence, JAMIA, and npj Digital Medicine — that MGB's medical leadership treats as the credibility threshold for clinical AI evidence.

The HMS Clinical AI Evaluation Laboratory, established in 2022 with funding from the National Library of Medicine and a consortium of four Boston-area health systems that includes MGB, operates under a formal data-sharing agreement that allows the laboratory to access de-identified case-level data from MGB's pilot evaluations for independent re-analysis. The laboratory does not design the pilots — that is the clinical informatics team's function — but it audits the pilot designs before they begin and conducts its own secondary analysis after the pilot data is locked. The secondary analysis results are shared with MGB's AI Steering Committee before the Gate five clinical audit, giving the committee an independent check on the clinical informatics team's primary analysis. On the ED triage agent pilot, the laboratory's secondary analysis produced a sensitivity estimate that was 3.1 percentage points lower than the primary analysis, triggering a methodology review that delayed Gate five by six weeks. Adeyemi regards that delay as the framework working correctly.

The research relationship also generates an output that the clinical programme values in ways that are difficult to quantify internally: published peer-reviewed evidence. Osei-Kwame's laboratory published the first evaluation study from the MGB partnership in January 2024 — a prospective observational study of the system's ambient documentation capability deployed across three internal medicine inpatient units at Brigham and Women's Hospital. The study, published in npj Digital Medicine, documented a 22-point improvement on the validated METRIC documentation quality instrument with no statistically significant change in attending physician cognitive load as measured by the NASA Task Load Index. That publication is cited in MGB's submissions to the FDA's Digital Health Center of Excellence and in the system's responses to Massachusetts Department of Public Health inquiries about its clinical AI programme. External credibility functions as regulatory currency.

We did not build the Structured Readiness Protocol to slow things down. We built it because in an academic medical centre of this visibility, one significant failure does not set one programme back — it sets the entire institution's relationship with clinical AI back by years.

Vendor selection: the MERIT framework and who it favours

MGB's vendor evaluation methodology — the Medical Evidence and Readiness Integration Threshold framework, or MERIT — was developed by Adeyemi's team in collaboration with Osei-Kwame's laboratory and MGB's enterprise procurement function. MERIT scores vendor candidates across five dimensions: clinical evidence quality, regulatory credential set, integration depth with MGB's Epic instance, auditability of the vendor's model outputs, and the vendor's demonstrated ability to participate in health system-directed evaluation studies rather than relying solely on proprietary performance claims. The fifth dimension is the one that most frequently eliminates candidates. Several vendors with strong clinical evidence and FDA clearance credentials declined to participate in health system-directed evaluation on the grounds that the study design requirements would delay their commercial timeline. MERIT scores those refusals as disqualifying.

Abridge was selected for the ambient documentation capability after a ten-month evaluation that included Nuance DAX, Suki AI, and Nabla as competing candidates. Nuance DAX's integration with Epic — Microsoft-owned, embedded in the EHR workflow — was the strongest integration argument in the field, but Nuance declined the HMS Evaluation Laboratory's request to participate in the independent secondary analysis as a condition of the MERIT Gate four review. Abridge accepted. The documentation quality data that emerged from the pilot was sufficient to clear Gate five on first review; the laboratory's secondary analysis produced estimates within the confidence intervals of the primary analysis. Abridge is now deployed across 14 attending physician cohorts at Massachusetts General and Brigham and Women's. The deployment is not system-wide; it is structured expansion, with each new cohort entering a 90-day monitoring period before the next cohort is authorised.

For radiology triage, MGB selected Aidoc following an evaluation that also included Annalise.ai and Viz.ai as candidates. Aidoc's FDA 510(k) clearance portfolio — covering pulmonary embolism, intracranial haemorrhage, large vessel occlusion, and aortic dissection — compressed the Gate three alignment review materially; each cleared indication was treated as a pre-validated regulatory alignment rather than requiring institution-led prospective clinical validation on the primary safety claim. The HL7 FHIR integration between Aidoc's platform and MGB's Epic instance, built over eight weeks in Q3 2023, feeds directly into the SRP audit schema. At MGB's imaging volume — approximately 420,000 studies annually across the enterprise — Aidoc's performance monitoring produces statistically meaningful quarterly signals without requiring artificial case augmentation. The current monitoring dataset includes 47 confirmed cases where Aidoc's PE alert preceded the attending radiologist's primary read by more than four minutes. Adeyemi's team presents that figure as an operational safety signal, not a marketing claim, and is explicit that it does not constitute a randomised controlled trial.

The third authorised capability — a post-discharge care navigation agent evaluated in partnership with Hippocratic AI across two cardiology units at Brigham and Women's — completed its Gate five clinical audit in December 2023. The pilot enrolled 280 patients over a 16-week period. The primary safety endpoint was the rate of clinically significant events — defined as emergency department visits or unplanned readmissions within 30 days of discharge — that were preceded by an escalation failure attributable to the agent capability. The audit panel recorded zero attributable escalation failures. The secondary endpoint was patient acceptance rate for AI-initiated post-discharge contact: 71 per cent of enrolled patients completed at least one agent interaction, a figure that Adeyemi describes as higher than her team projected for a cardiac population that skews toward patients over 65. Hippocratic AI's production deployment, pending a final technical integration review, is expected in Q2 2024.

The Boston academic context: why geography shapes the programme

Mass General Brigham operates in a healthcare market unlike any other in the United States. Boston concentrates more academic medical centres, biomedical research institutions, and health technology companies per square mile than any comparable metropolitan area, and that density creates competitive and regulatory dynamics that shape the diagnostic agent programme in ways that a health system in a mid-size market would not face. The competitive pressure is bidirectional: MGB's peer institutions — Beth Israel Deaconess Medical Center, Boston Children's Hospital, Dana-Farber Cancer Institute, all operating within a few miles of each other under the HMS umbrella — are running their own clinical AI programmes, and the vendor community prices its products accordingly. Every major clinical AI vendor has a Boston sales presence, and most have a research partnership of some kind with at least one HMS affiliate.

The Massachusetts Department of Public Health has been engaged with MGB's programme since early 2023, following the department's issuance of a clinical AI guidance document in February of that year that required licensed general hospitals operating LLM-based clinical tools to maintain documentation sufficient for retrospective regulatory audit. The MassDPH guidance does not have the prescriptive specificity of the New York State Department of Health's equivalent framework — it does not specify retention windows or audit log schema — but it has been applied in two enforcement conversations involving Boston-area hospitals that deployed clinical AI tools without formal notification to the department. MGB's formal notification, submitted in March 2023 and updated quarterly, has produced an ongoing dialogue with MassDPH that Adeyemi characterises as collaborative rather than adversarial. The department's clinical technology team has observed two SRP Gate five audits as silent observers — an arrangement that has no formal regulatory basis but that both parties regard as preferable to a post-deployment enforcement conversation.

The HMS affiliation introduces a dimension that most health systems outside the Boston-Cambridge corridor do not contend with: the programme's decisions are legible to, and sometimes contested by, a medical faculty with strong independent views on clinical AI epistemology. Osei-Kwame's laboratory is supportive of the programme's evidence standards. Not every HMS faculty voice is. At a January 2024 HMS Grand Rounds presentation, two professors of medicine argued publicly that MGB's MERIT framework placed excessive weight on vendor participation in evaluation studies and insufficient weight on pre-existing published evidence — a position that, if adopted by the AI Steering Committee, would favour vendors with large published evidence bases over vendors willing to submit to health system-directed validation. Adeyemi's team responded in writing through the committee rather than publicly. The committee's position, unchanged after the exchange, is that participation in health system-directed evaluation is not a substitute for published evidence but a complement to it — and that a vendor willing to participate in independent secondary analysis is providing a form of transparency that published proprietary benchmarks do not replicate.

What to watch

The programme's next decision window is mid-2024, when the Hippocratic AI production deployment authorisation is resolved and the HMS Evaluation Laboratory's second peer-reviewed publication — covering the ED triage agent pilot — enters journal review. Three additional vendor evaluations are in the MERIT pipeline: one covering AI-assisted prior authorisation documentation, one covering a sepsis early warning agent that would replace MGB's current rule-based clinical decision support tool, and one covering ambient patient intake documentation for outpatient settings. None are expected to reach Gate four before Q4 2024.

Whether the Hippocratic AI production deployment at Brigham and Women's produces patient acceptance rates in the full cardiology population that replicate the pilot's 71 per cent figure; the pilot's enrolment criteria excluded patients with documented cognitive impairment and those whose primary care language was not English or Spanish, and the production population will include both groups, altering the interaction dynamic in ways the pilot was not powered to predict.
Whether the FDA's Digital Health Center of Excellence issues the clinical AI framework guidance that has been in draft since late 2023, and whether it introduces accountability requirements — patient disclosure, clinician attestation, model transparency documentation — that the SRP's current Gate three alignment review does not capture; MGB's legal team has flagged two likely requirement areas that would require SRP amendments, and the committee has pre-positioned a working group to respond within 60 days of final guidance publication.
Whether the prior authorisation automation evaluation produces a vendor that can clear MERIT's Gate five within the fiscal year; prior authorisation denial management costs MGB an estimated $18.2 million annually in clinical staff time and appeals processing, and the financial case for that capability is the strongest in the current pipeline — strong enough that the enterprise procurement function has flagged it as a budget priority independent of the clinical programme's usual SRP timeline.
Whether MGB's approach to HMS faculty engagement on clinical AI epistemology becomes a structural governance question rather than a periodic grand rounds debate; if the faculty opposition to the MERIT framework's evaluation participation requirement reaches the HMS Dean's office and is framed as a question of academic independence rather than vendor selection methodology, the AI Steering Committee's authority over programme design could be contested from within the institution's own research structure.
Whether Epic's accelerating clinical AI roadmap — its ambient documentation, differential diagnosis, and prior authorisation products are all maturing simultaneously — forces MGB to revisit the Abridge deployment decision before the 18-month authorisation expiry in Q4 2024; Epic's native integration argument is strongest in a health system with a single EHR instance, and MGB's Epic deployment is unified across all inpatient facilities.

Frequently asked

What is the Structured Readiness Protocol, and how does it differ from evaluation frameworks at comparable academic medical centres?: The SRP is MGB's five-gate approval framework for clinical agent capabilities. Its distinguishing feature is the mandatory independent secondary analysis at Gate five, conducted by the HMS Clinical AI Evaluation Laboratory, which functions as a check on the clinical informatics team's primary analysis rather than a rubber-stamp review. At Mayo Clinic and Mount Sinai, equivalent governance frameworks are administered entirely internally or with optional external advisory input. MGB's SRP makes external academic review a hard requirement, which adds time but produces an evidence base that regulatory bodies and the published literature treat differently from internally validated performance claims.
Why did MGB select Abridge over Nuance DAX for ambient documentation, given Nuance's stronger Epic integration?: Integration depth was Nuance's strongest argument, but MERIT's fifth evaluation dimension — willingness to participate in health system-directed evaluation studies subject to HMS Evaluation Laboratory secondary analysis — was the disqualifying factor. Nuance declined the independent secondary analysis condition. Abridge accepted it. The documentation quality data that resulted from Abridge's pilot cleared Gate five without revision. MGB's framework treats integration depth as a necessary but not sufficient condition for selection; auditability of the vendor's claims is weighted equally. The Nuance decision is reviewed annually as Epic's Cheers ambient documentation product matures.
How does MGB's Harvard Medical School affiliation shape the clinical AI programme in practice?: The HMS relationship functions as both an asset and a constraint. The asset is the HMS Clinical AI Evaluation Laboratory, which provides methodological independence, IRB infrastructure, and journal credibility that internal evaluation cannot replicate. The constraint is that MGB's clinical AI decisions are legible to an opinionated medical faculty whose views on evidence standards do not always align with the AI Steering Committee's programme design choices. The January 2024 Grand Rounds debate over MERIT's evaluation participation requirement is the clearest recent example: faculty criticism did not change committee policy, but it introduced a governance legitimacy question that the committee had to address formally rather than administratively. That dynamic does not exist at health systems without a co-located research university.
What is MGB's current regulatory exposure, and how is the Massachusetts Department of Public Health involved?: MGB submitted a formal programme notification to MassDPH in March 2023 and updates it quarterly. The MassDPH clinical AI guidance, issued in February 2023, does not specify audit log retention windows or schema requirements with the precision of New York State's equivalent framework, but it has been applied in enforcement conversations with other Boston-area hospitals that deployed LLM-based clinical tools without notification. MGB's proactive notification and the department's participation as silent observers in two Gate five audits have produced a working relationship that Adeyemi characterises as collaborative. The pending FDA Digital Health Center of Excellence framework guidance is the larger regulatory variable; if it introduces disclosure or attestation requirements that the SRP's Gate three alignment review does not currently capture, MGB's pre-positioned working group is authorised to respond within 60 days of final guidance publication.
What is the MERIT framework's fifth evaluation dimension, and why does it eliminate so many vendor candidates?: The fifth MERIT dimension scores a vendor's demonstrated willingness to participate in health system-directed evaluation studies — specifically, to accept independent secondary analysis of pilot data by the HMS Clinical AI Evaluation Laboratory as a condition of Gate four review. It eliminates candidates because participation imposes a commercial cost: it delays the vendor's deployment timeline, exposes proprietary performance claims to independent scrutiny, and creates a published evidence record that competitors can cite. Vendors with strong proprietary performance data and established market positions have the most to lose from independent secondary analysis that might produce estimates below their published benchmarks. MGB's position is that vendors unwilling to accept that scrutiny are signalling something about the robustness of their performance claims. Several vendors have challenged that position commercially. None have reversed MGB's committee on the requirement.

Mass General Brigham is running the most rigorously governed clinical AI programme among the major American academic medical centres. The Structured Readiness Protocol, the HMS Evaluation Laboratory partnership, the MERIT framework's evaluation participation requirement, the MassDPH regulatory dialogue — none of these are ornamental. They are the operational architecture of a programme that has decided the cost of moving slowly is lower than the cost of a high-profile failure at an institution whose name carries a weight that most health systems do not have to account for. Three authorised capabilities in 18 months is not a slow programme. It is a programme that has decided what it is for.

The question the next 12 months will answer is whether the SRP's velocity ceiling becomes a competitive liability as clinical AI moves from advisory to workflow-integrated capabilities. The prior authorisation automation case — $18.2 million in annual denial management costs, a positive financial case that does not depend on value-based contract economics — will be the first test of whether the framework can produce a deployment authorisation at the speed the financial argument demands. If it can, the SRP's design holds. If it cannot, MGB will face a choice between modifying a governance architecture it built deliberately and accepting that its clinical AI programme's scope is constrained by the rigour it set out to exemplify.