Shanghai Jiao Tong Researchers Propose OctoThinker for Reinforcement Learning-Scalable LLM Development

Introduction: Reinforcement Learning Progress through Chain-of-Thought Prompting
LLMs have shown excellent progress in complex reasoning tasks through CoT prompting combined with large-scale reinforcement learning (RL). Models like Deepseek-R1-Zero have shown strong reasoning capabilities by applying RL directly to base models. Similarly, methods such as SimpleRL and Open-ReasonerZero show improvements in smaller models like the Qwen series. However, achieving success across different base model families remains a challenge. Moreover, applying R1-Zero-style training to base models such as the Llama series faces difficulty, posing a fundamental question about the underlying factors that lead different base models to behave inconsistently during reinforcement learning.
Limitations of RL Scaling on Llama Models
Large-scale RL advances in models like OpenAI’s o1, o3, and DeepSeek’s R1 on competition-level mathematics problems, motivating the exploration of RL on smaller models with less than 100B parameters. However, they are limited to the Qwen model family, while replicating results on families such as Llama is difficult. The lack of transparency in pre-training pipelines has made it difficult to understand how pre-training influences RL scaling. This has prompted unconventional studies, which found that one-shot prompting improves reasoning in Qwen but offers little benefit in Llama. Efforts to curate high-quality mathematical pre-training corpora through projects like OpenWebMath, MathPile, InfiMM-Web-Math, and FineMath have made progress but remain limited in scale under 100B tokens.

Exploring Mid-Training with Stable-then-Decay Strategy
Researchers from Shanghai Jiao Tong University investigate how mid-training strategies shape RL dynamics, focusing on Qwen and Llama. The study presents several insights: First, high-quality mathematical corpora such as MegaMath-Web-Pro boost both base model and RL outcomes. Second, using QA-style data, especially those with long CoT reasoning, further enhances RL results. Third, long CoT introduces verbosity and instability in RL training. Lastly, applying scaling during mid-training results in stronger downstream RL performance. Researchers introduce a two-stage mid-training strategy called Stable-then-Decay, where base models are first trained on 200B tokens, followed by 20B tokens across three CoT-focused branches, resulting in OctoThinker models that show strong RL compatibility.
RL Configuration and Benchmark Evaluation
Researchers use the MATH8K dataset for RL training prompts. The configuration includes a global training batch size of 128, 16 rollout responses per query, and a PPO mini-batch size of 64, with experiments conducted on Llama-3.2-3B-Base and Qwen2.5-3B-Base models. For evaluation, few-shot prompting is used for base language models, and zero-shot for RL-tuned models across indicator tasks, including GSM8K, MATH500, OlympiadBench, and AMC23. During RL training, Qwen models exhibit increasing response lengths that remain reasonable throughout, whereas Llama displays abnormal behavior, with average response lengths escalating to 4,096 tokens. Evaluation further reveals that RL-tuned Qwen2.5-3B achieves improvements across benchmarks, while Llama-3.2-3B shows only marginal gains.
OctoThinker Outperforms Llama in RL Compatibility
Each OctoThinker branch demonstrates 10%-20% improvement over the original Llama base model and consistent gains over the stable-stage model across all sizes when evaluated on 13 mathematical benchmarks. The OctoThinker-Zero families reveal diverse thinking behaviors during RL scaling, with strong performance from the OctoThinker-Long variant. When comparing three 3B-scale base models during RL training, OctoThinker-Long-3B outperforms the original Llama-3.2-3B model and reaches performance parity with Qwen2.5-3B, a model known for strong reasoning capabilities and extensive pre-training. The hybrid and short branches show slightly lower performance, especially on challenging benchmarks
Conclusion and Future Work: Toward RL-Ready Foundation Models
This paper investigates why base models such as Llama and Qwen exhibit divergent behaviors during RL for reasoning, showing that mid-training plays a major role in RL scalability. The two-stage mid-training strategy transforms Llama into a foundation model better suited for RL, resulting in OctoThinker models. Future research directions include:
- Curating higher-quality mathematical corpora to improve mid-training.
- Creating RL-friendly base models using open recipes without distillation from long CoT reasoning models.
- Separating the QA format and content to understand their contributions individually.
- Expanding the OctoThinker family with new branches, such as tool-integrated reasoning.
Check out the Paper, Hugging Face Page and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
Sajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.