AI Just Beat Doctors at Diagnosing ER Patients. Don't Get All Excited
Key Points:
- Researchers at Harvard and Beth Israel Deaconess Medical Center tested OpenAI’s reasoning large language model (LLM) o1-preview in diagnosing emergency room patients, finding it achieved 67.1% accuracy, outperforming two expert physicians who scored 55.3% and 50.0%.
- In diagnosing complex clinical vignettes, o1-preview included the correct diagnosis in 78.3% of cases and suggested helpful diagnoses in 97.9%, surpassing both ChatGPT-4 and human physician baselines.
- Study authors emphasize AI is not a replacement for doctors but a collaborative tool that requires rigorous testing to ensure it improves patient outcomes, with clinicians maintaining oversight and accountability.
- The reasoning model still struggles with multimodal inputs like medical images and audio, highlighting a key area for future research to improve diagnostic capabilities.
- Experts caution about AI limitations such as hallucinations and potential manipulation, stressing the importance of AI safety and the principle of “trust, but verify” when integrating AI into clinical practice.