Natural language processing

GPT-4o and Co. get it wrong more often than right, says OpenAI study

GPT-4o and Co. get it wrong more often than right, says OpenAI study



summary
Summary

A new OpenAI study using their in-house SimpleQA benchmark shows that even the most advanced AI language models fail more often than they succeed when answering factual questions.

The SimpleQA test contains 4,326 questions across science, politics, and art, with each question designed to have one clear correct answer. Two independent reviewers verified answer accuracy.

Pie chart: Distribution of 10 subject areas in the SimpleQA database with percentages and case numbers.
The thematic distribution of the SimpleQA database shows a broad thematic coverage, which should allow a comprehensive evaluation of AI models. | Image: Wei et al.

OpenAI’s best model, o1-preview, achieved only a 42.7 percent success rate. GPT-4o followed with 38.2 percent correct answers, while the smaller GPT-4o-mini managed just 8.6 percent accuracy.

Anthropic’s Claude models performed even worse. Their top model, Claude-3.5-sonnet, got 28.9 percent right and 36.1 percent wrong. However, smaller Claude models more often declined to answer when uncertain – a desirable response that shows they recognize their knowledge limitations.

Ad

Table comparing the performance of 8 AI models: correctness, error rates and F-scores for SimpleQA tests.
OpenAI o1-preview achieves the highest F-score of 44.8, while smaller models such as GPT-4o-mini perform significantly worse. This is to be expected since smaller models are trained on less data. | Image: Wei et al.

Context matters

Note that the test specifically measures knowledge acquired during training. It does not assess the models’ general ability to provide correct answers when given additional context, Internet access, or database connections.

The key takeaway: Users should think of AI models as information processors, not as stand-alone sources of knowledge. For best results, provide them with reliable data rather than relying solely on their built-in knowledge.

Table with four sample questions and corresponding answers from the SimpleQA database on various topics.
Sample questions from SimpleQA, covering everything from TV shows to music to scientific awards. | Image: Wei et al.

But OpenAI’s findings raise concerns about current patterns of AI use. Many people, especially students, are using these systems as standalone research and learning tools because they believe they are giving accurate answers, at least in most cases – a practice that these results suggest is problematic. The data shows that AI models simply aren’t reliable enough for independent fact-finding or verification.

AI models overestimate themselves

The study shows also shows that AI language models significantly overestimate their own capabilities when answering questions. When researchers asked the models to rate their confidence in their answers, the AIs consistently gave inflated scores about their own accuracy.

To measure this overconfidence systematically, researchers had the models answer identical questions 100 times each. They found that when a model gave the same answer repeatedly, it was more likely to be correct – but even then, actual success rates remained lower than what the models predicted about their own performance. This finding fits with the common criticism that language models can answer complete nonsense while acting like they know it’s right.

Recommendation

Dual-panel graph comparing language model calibration: left shows accuracy vs stated confidence across 15 intervals; right displays accuracy vs answer frequency over 30 intervals. Dotted line indicates perfect calibration.
All models demonstrate significant overconfidence, with actual accuracy rates (left) consistently falling below their stated confidence levels. Even when repeating answers frequently (right), models still struggle to match their predicted performance. | Bild: Wei et al.

The researchers note significant gaps in current AI systems’ factual accuracy that need addressing. They also point out an open research question: whether an AI’s performance on short factual answers predicts how well it handles longer, more detailed responses containing multiple facts.

OpenAI has released its SimpleQA benchmark on Github to help researchers develop more reliable language models.

GPT-4o and Co. get it wrong more often than right, says OpenAI study

Source link