ChatPaper.aiChatPaper

平行-R1:迈向基于强化学习的平行思维

Parallel-R1: Towards Parallel Thinking via Reinforcement Learning

September 9, 2025
作者: Tong Zheng, Hongming Zhang, Wenhao Yu, Xiaoyang Wang, Xinyu Yang, Runpeng Dai, Rui Liu, Huiwen Bao, Chengsong Huang, Heng Huang, Dong Yu
cs.AI

摘要

平行思維作為一種新穎的方法,旨在通過同時探索多條推理路徑來增強大型語言模型(LLMs)的推理能力。然而,通過訓練激活此類能力仍具挑戰性,因為現有方法主要依賴於對合成數據進行監督微調(SFT),這鼓勵了教師強制模仿而非探索與泛化。與此不同,我們提出了Parallel-R1,這是首個能夠在複雜現實世界推理任務中實現平行思維行為的強化學習(RL)框架。我們的框架採用了一種漸進式課程,明確解決了使用RL訓練平行思維時的冷啟動問題。我們首先在較簡單任務的提示生成軌跡上使用SFT來灌輸平行思維能力,然後過渡到RL,在更難的問題上探索並泛化這一技能。在包括MATH、AMC23和AIME在內的各種數學基準測試上的實驗表明,Parallel-R1成功灌輸了平行思維,相比直接在挑戰性任務上使用RL訓練的順序思維模型,準確率提高了8.4%。進一步分析揭示了模型思維行為的明顯轉變:在早期階段,它將平行思維作為一種探索策略,而在後期階段,則利用相同能力進行多視角驗證。最重要的是,我們驗證了平行思維作為訓練中期探索支架的作用,這一臨時探索階段在RL後釋放了更高的性能上限,在AIME25上相比基線提升了42.9%。我們的模型、數據和代碼將在https://github.com/zhengkid/Parallel-R1開源。
English
Parallel thinking has emerged as a novel approach for enhancing the reasoning capabilities of large language models (LLMs) by exploring multiple reasoning paths concurrently. However, activating such capabilities through training remains challenging, as existing methods predominantly rely on supervised fine-tuning (SFT) over synthetic data, which encourages teacher-forced imitation rather than exploration and generalization. Different from them, we propose Parallel-R1, the first reinforcement learning (RL) framework that enables parallel thinking behaviors for complex real-world reasoning tasks. Our framework employs a progressive curriculum that explicitly addresses the cold-start problem in training parallel thinking with RL. We first use SFT on prompt-generated trajectories from easier tasks to instill the parallel thinking ability, then transition to RL to explore and generalize this skill on harder problems. Experiments on various math benchmarks, including MATH, AMC23, and AIME, show that Parallel-R1 successfully instills parallel thinking, leading to 8.4% accuracy improvements over the sequential thinking model trained directly on challenging tasks with RL. Further analysis reveals a clear shift in the model's thinking behavior: at an early stage, it uses parallel thinking as an exploration strategy, while in a later stage, it uses the same capability for multi-perspective verification. Most significantly, we validate parallel thinking as a mid-training exploration scaffold, where this temporary exploratory phase unlocks a higher performance ceiling after RL, yielding a 42.9% improvement over the baseline on AIME25. Our model, data, and code will be open-source at https://github.com/zhengkid/Parallel-R1.
PDF873September 10, 2025