Apple AI boffins pour cold water on reasoning models • The Register

If you are betting on AGI – artificial general intelligence, the point at which AI models rival human cognition – showing up next year, you may want to adjust your timeline.
Apple AI researchers have found that the “thinking” ability of so-called “large reasoning models” collapses when things get complicated. The authors’ findings, described in a paper titled, “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity,” indicate that the intellectual potential of such models is so far quite limited.
Large reasoning models (LRMs), such as OpenAI’s o1/o3, DeepSeek-R1, Claude 3.7 Sonnet Thinking, and Gemini Thinking, are designed to break problems down into smaller steps. Instead of responding to a prompt with a specific prediction, they use mechanisms like Chain of Thought to iterate through a series of steps, validating their intermediate answers along the way, to arrive at a solution to the stated problem.
Authors Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar set out to test how these reasoning models perform. So they designed a puzzle environment for the models as an alternative to applying standard benchmark tests.
The puzzle regime gave the researchers control over the complexity of the challenges while avoiding benchmark data contamination, a problem that arises when language models inadvertently absorb evaluation benchmarks during training, skewing their performance in testing. Some model makers have also been accused of gaming benchmarks, which just aren’t all that great to begin with.
The puzzle environment included various games like the Tower of Hanoi, in which the goal is to stack a set of differently sized disks in order of size by moving them one at a time between three upright pegs.
The researchers found reasoning models did better with moderately complex problems, but broke down at a certain level of complexity.
“[D]espite their sophisticated self-reflection mechanisms learned through reinforcement learning, these models fail to develop generalizable problem-solving capabilities for planning tasks, with performance collapsing to zero beyond a certain complexity threshold,” the paper says.
Reasoning models also underperformed simple large language models on easier problems – they often found the correct solution early but kept looking, inefficiently burning compute on unnecessary steps.
The authors argue that the results suggest large reasoning models may not provide a path toward better artificial thinking.
“These insights challenge prevailing assumptions about LRM capabilities and suggest that current approaches may be encountering fundamental barriers to generalizable reasoning,” the authors conclude. ®