Evaluation format, not model capability, drives triage failure in the assessment of consumer health AI
David Fraile Navarro, Farah Magrabi, Enrico Coiera
Don't trust health AI evaluations that use exam-style protocols. If you're evaluating consumer health tools, test under conditions that reflect actual use—conversational, clarifying questions allowed. The headline under-triage rate is an artifact of methodology, not model capability.
A Nature Medicine study claimed ChatGPT under-triages 51.6% of emergencies, suggesting consumer health AI poses safety risks. The evaluation used exam-style forced-choice formats that don't reflect how people actually use chatbots.
Method: Testing five frontier LLMs under constrained (exam-style, forced A/B/C/D output) versus naturalistic (patient-style messages) conditions revealed the evaluation format was the failure mechanism. Naturalistic interaction improved triage accuracy by 6.4 percentage points. Three models scored 0-24% with forced choice but 100% with free text (all p < 10^-8), consistently recommending emergency care in their own words while the forced-choice format registered under-triage. Diabetic ketoacidosis was correctly triaged in 100% of trials across all models and conditions.
Caveats: Tested on 17 scenarios from the original study. Full replication across all original cases not performed.
Reflections: What other high-stakes AI evaluations are contaminated by format artifacts that don't reflect deployment conditions? · How do clarifying questions versus forced-choice responses affect triage accuracy across different medical conditions? · What evaluation protocols would capture both safety risks and realistic usage patterns for consumer health AI?