Towards autonomous medical artificial intelligence agents
Key Points:
- The MIMIC-IV benchmark dataset comprises 574 patients selected from a comprehensive publicly available EHR database of approximately 300,000 patients treated at Beth Israel Deaconess Medical Center between 2008 and 2019, focusing on eight target diagnoses including abdominal pathologies and internal medicine emergencies.
- Dataset curation involved strict inclusion criteria using ICD codes, extraction of clinical history, labs, microbiology, imaging, medications, and procedures within the first 24 hours of admission, with manual physician review to exclude cases lacking essential clinical data or imaging.
- A multi-turn conversational AI framework was developed featuring a patient simulation agent grounded in real clinical histories and a physician agent (MIRA) capable of requesting clinical information via FHIR-compliant tools, simulating realistic diagnostic workflows.
- Extensive evaluation protocols assessed patient agent consistency, robustness against adversarial prompts, and diagnostic accuracy of MIRA compared to human physicians, using LLM-based evaluators and manual physician adjudication to ensure unbiased and reliable performance metrics.
- Additional analyses included assessment of medication reconciliation, procedure matching, guideline adherence, safety in triage decisions, and robustness to prompt biases, with rigorous statistical methods applied to validate findings and confirm MIRA’s clinical decision-making capability under real-world conditions.