AI study reveals key factors behind LLM’s long-term reasoning abilities
 
									 
A systematic investigation reveals the methods by which the long chains of thought of reasoning models are generated. The results provide practical tips for optimizing training strategies.
The team at IN.AI, along with researchers from Tsinghua University and Carnegie Mellon University, has mapped out how AI models develop their ability to work through long chains of thought. Their systematic study used supervised fine-tuning (SFT) and reinforcement learning (RL) to identify the key factors behind this capability.
The research yielded four key insights. First, while SFT makes training more efficient and straightforward, it isn’t essential – supporting what Deepseek found with their R1-Zero model. The team tested this using Llama-3.1-8B and Qwen2.5-7B math models, training them with both long and short reasoning chains. They found that SFT with longer chains of thought not only performed better, but also made subsequent RL improvements more effective.
Second, while more computing power during RL training tends to improve reasoning abilities, it’s not guaranteed. The length of reasoning chains doesn’t always grow steadily during RL training, making the right reward design crucial for consistent improvement.
Ad
Third, getting reliable reward signals at scale is key to successful RL training. The team explored using web-scraped data with imperfect solutions to scale up these signals. Testing with the WebInstruct dataset, they compared different verification methods and found that rule-based verification worked best when filtering for shorter responses. Using diverse data, even if somewhat noisy, proved especially valuable for handling unusual cases compared to models trained on carefully verified data.
Fourth, while base models already contain core capabilities like error correction, using RL to apply these skills to complex tasks can require significant computing resources.
Larger models still seem to be important
The research suggests that some behaviors, like double-checking solutions, might be learned during pre-training, possibly from human discussions in online forums. RL seems to mainly help models recombine skills they already picked up during pre-training.
The team believes that model size remains the main constraint on developing more sophisticated reasoning abilities in smaller models. They’re considering testing RL with larger base models in the future, though the necessary open-source infrastructure for such experiments is still developing.

 
                             
                             
                             
                            
 
             
            
 
				 
				