OpenAI’s o1-preview AI system outperforms human doctors in diagnosing tricky medical cases, study finds
A new study suggests OpenAI’s o1-preview AI system might be better at diagnosing tricky medical cases than human doctors.
A team of researchers from Harvard Medical School and Stanford University put o1-preview through a comprehensive series of medical diagnosis tests. Their findings show the AI system has made remarkable strides compared to previous versions.
According to the study, o1-preview correctly diagnosed 78.3% of all cases it examined. In a direct comparison of 70 specific cases, the system performed even better, accurately diagnosing 88.6% of cases – significantly outperforming its predecessor GPT-4, which managed 72.9%.
When it comes to medical reasoning, o1-preview’s performance was even more striking. Using the R-IDEA scale, a standard measure for evaluating medical reasoning quality, the AI system achieved perfect scores in 78 out of 80 cases. To put that in perspective, experienced doctors reached perfect scores in only 28 cases, while medical residents managed it in just 16 cases.
Ad
The researchers acknowledge that some test cases might have been included in o1-preview’s training data. However, when they tested the system on newer cases it had never encountered, its performance only dropped slightly.
One of the study authors, Dr. Adam Rodman, emphasizes the exceptional results on X: “This is the first time I have promoted one of our preprints (rather than the full peer-reviewed study) so caveat emptor. But I truly think our results have implications for medical practice so I wanted to get them out as quickly as possible.”
Better at Complex Cases Than Human Doctors
The AI system really shined when tackling complex management cases that 25 specialists had specifically designed to be difficult. “Humans appropriately struggled. But o1 – you don’t need statistics to see how well it performed,” Rodman explains.
In these tough cases, o1-preview scored 86% of possible points. That’s more than double what doctors achieved using GPT-4 (41%) or traditional tools (34%).
The system isn’t perfect, though. It struggled with probability assessments, showing no real improvement over older models. For example, when estimating the likelihood of pneumonia, o1-preview suggested 70% – way above the scientific range of 25-42%.
Recommendation
The researchers found a pattern: while the system excels at tasks requiring critical thinking, like making diagnoses and recommending treatments, it has trouble with more abstract challenges like estimating probabilities.
They also point out that o1-preview tends to give detailed answers, which might have boosted its scores. Plus, the study only looked at o1-preview working alone – not how well it might work alongside human doctors.
Some critics argue that o1-preview’s suggested diagnostic tests are often too expensive and impractical for real-world use.
Since then, OpenAI has released the full o1 version and its successor o3, which show significant improved performance on complex reasoning tasks – far surpassing o1-preview’s capabilities in benchmarks that require deep analytical thinking.
Still, even these more powerful models don’t address the core concerns critics have raised about practical implementation and cost. Having a more capable AI system doesn’t automatically solve the challenge of making it work in real-world healthcare settings.
How to test medical AI
Rodman cautions against overhyping the results: “This is a benchmarking study. While these are ‘gold standard’ evaluations of reasoning that we use for human clinicians, these are obviously not actual medical care. Do not get rid of your doctor in favor of o1.”
The researchers say we need better ways to evaluate medical AI systems. Multiple-choice tests don’t capture the complexity of real medical decision-making.
They’re calling for new, more practical testing methods, real-world clinical trials, better technical infrastructure, and improved ways for humans and AI to work together.