GPT-4o and Co. get it wrong more often than right, says OpenAI study
A new OpenAI study using their in-house SimpleQA benchmark shows that even the most advanced AI language models fail more often than they succeed when answering factual questions.
The SimpleQA test contains 4,326 questions across science, politics, and art, with each question designed to have one clear correct answer. Two independent reviewers verified answer accuracy.
OpenAI’s best model, o1-preview, achieved only a 42.7 percent success rate. GPT-4o followed with 38.2 percent correct answers, while the smaller GPT-4o-mini managed just 8.6 percent accuracy.
Anthropic’s Claude models performed even worse. Their top model, Claude-3.5-sonnet, got 28.9 percent right and 36.1 percent wrong. However, smaller Claude models more often declined to answer when uncertain – a desirable response that shows they recognize their knowledge limitations.
Ad
Context matters
Note that the test specifically measures knowledge acquired during training. It does not assess the models’ general ability to provide correct answers when given additional context, Internet access, or database connections.
The key takeaway: Users should think of AI models as information processors, not as stand-alone sources of knowledge. For best results, provide them with reliable data rather than relying solely on their built-in knowledge.
But OpenAI’s findings raise concerns about current patterns of AI use. Many people, especially students, are using these systems as standalone research and learning tools because they believe they are giving accurate answers, at least in most cases – a practice that these results suggest is problematic. The data shows that AI models simply aren’t reliable enough for independent fact-finding or verification.
AI models overestimate themselves
The study shows also shows that AI language models significantly overestimate their own capabilities when answering questions. When researchers asked the models to rate their confidence in their answers, the AIs consistently gave inflated scores about their own accuracy.
To measure this overconfidence systematically, researchers had the models answer identical questions 100 times each. They found that when a model gave the same answer repeatedly, it was more likely to be correct – but even then, actual success rates remained lower than what the models predicted about their own performance. This finding fits with the common criticism that language models can answer complete nonsense while acting like they know it’s right.
Recommendation
The researchers note significant gaps in current AI systems’ factual accuracy that need addressing. They also point out an open research question: whether an AI’s performance on short factual answers predicts how well it handles longer, more detailed responses containing multiple facts.
OpenAI has released its SimpleQA benchmark on Github to help researchers develop more reliable language models.