ChatGPT Health performance in a structured test of triage recommendations
Key Points:
- ChatGPT Health, launched in January 2026, was evaluated using 60 clinician-authored vignettes across 21 clinical domains, resulting in 960 triage responses under varied conditions.
- The system showed an inverted U-shaped performance pattern, with the most dangerous triage errors occurring in non-urgent cases (35% failure) and emergency conditions (48% failure).
- It under-triaged 52% of gold-standard emergency cases, such as diabetic ketoacidosis and impending respiratory failure, often recommending delayed evaluation instead of immediate emergency care.
- Triage recommendations were significantly influenced by anchoring bias from family or friends minimizing symptoms, leading to less urgent care suggestions in edge cases.
- Crisis intervention messages triggered