Can AI Outperform Physicians in Clinical Reasoning?

In a research letter published in JAMA Internal Medicine, physician-scientists from Beth Israel Deaconess Medical Center (BIDMC) conducted a comparative analysis of a large language model's (LLM) reasoning capabilities against established benchmarks used for evaluating physicians' performance.

Given the multifaceted nature of diagnostic processes, the study aimed to assess whether LLMs could match physicians' proficiency in clinical reasoning.

The researchers employed the revised-IDEA (r-IDEA) score—a validated tool for evaluating clinical reasoning among physicians.

Twenty-one attending physicians and eighteen residents participated in the study, each tasked with analyzing a subset of twenty clinical cases structured into four sequential stages of diagnostic reasoning. Participants were instructed to articulate and justify their differential diagnoses at each stage. Likewise, ChatGPT-4 received identical prompts and processed all twenty clinical cases. The responses were evaluated based on clinical reasoning (r-IDEA score) and various other reasoning metrics.

Unexpectedly, ChatGPT-4 achieved the highest r-IDEA scores, attaining a median score of 10 out of 10, compared to 9 for attending physicians and 8 for residents. While diagnostic accuracy—measuring the placement of the correct diagnosis within the provided list— and correct clinical reasoning showed comparable performance between humans and the AI, researchers observed more instances of incorrect reasoning in the AI-generated responses. This outcome emphasizes the potential role of AI as a supplementary tool to enhance rather than replace human reasoning processes.

It is anticipated that AI will enhance the patient-physician interaction by streamlining current inefficiencies and enabling a more focused engagement during patient consultations.