并行-R1:通过强化学习实现并行思维
Parallel-R1: Towards Parallel Thinking via Reinforcement Learning
September 9, 2025
作者: Tong Zheng, Hongming Zhang, Wenhao Yu, Xiaoyang Wang, Xinyu Yang, Runpeng Dai, Rui Liu, Huiwen Bao, Chengsong Huang, Heng Huang, Dong Yu
cs.AI
摘要
并行思维作为一种新兴方法,旨在通过同时探索多条推理路径来增强大语言模型(LLMs)的推理能力。然而,通过训练激活这种能力仍面临挑战,因为现有方法主要依赖于对合成数据进行监督微调(SFT),这鼓励了教师强制模仿而非探索与泛化。与此不同,我们提出了Parallel-R1,这是首个能够针对复杂现实世界推理任务实现并行思维行为的强化学习(RL)框架。我们的框架采用渐进式课程,明确解决了在RL训练中并行思维的冷启动问题。我们首先在较简单任务上通过SFT对提示生成的轨迹进行训练,以培养并行思维能力,随后转向RL,在更复杂问题上探索并泛化这一技能。在包括MATH、AMC23和AIME在内的多个数学基准测试中,实验表明Parallel-R1成功植入了并行思维,相较于直接在挑战性任务上使用RL训练的序列思维模型,准确率提升了8.4%。进一步分析揭示了模型思维行为的明显转变:早期阶段,它将并行思维作为探索策略;而在后期阶段,则利用同一能力进行多视角验证。最为重要的是,我们验证了并行思维作为训练中期探索支架的作用,这一临时探索阶段在RL后解锁了更高的性能上限,在AIME25上相比基线提升了42.9%。我们的模型、数据和代码将在https://github.com/zhengkid/Parallel-R1开源。
English
Parallel thinking has emerged as a novel approach for enhancing the reasoning
capabilities of large language models (LLMs) by exploring multiple reasoning
paths concurrently. However, activating such capabilities through training
remains challenging, as existing methods predominantly rely on supervised
fine-tuning (SFT) over synthetic data, which encourages teacher-forced
imitation rather than exploration and generalization. Different from them, we
propose Parallel-R1, the first reinforcement learning (RL) framework
that enables parallel thinking behaviors for complex real-world reasoning
tasks. Our framework employs a progressive curriculum that explicitly addresses
the cold-start problem in training parallel thinking with RL. We first use SFT
on prompt-generated trajectories from easier tasks to instill the parallel
thinking ability, then transition to RL to explore and generalize this skill on
harder problems. Experiments on various math benchmarks, including MATH, AMC23,
and AIME, show that Parallel-R1 successfully instills parallel thinking,
leading to 8.4% accuracy improvements over the sequential thinking model
trained directly on challenging tasks with RL. Further analysis reveals a clear
shift in the model's thinking behavior: at an early stage, it uses parallel
thinking as an exploration strategy, while in a later stage, it uses the same
capability for multi-perspective verification. Most significantly, we validate
parallel thinking as a mid-training exploration scaffold, where this
temporary exploratory phase unlocks a higher performance ceiling after RL,
yielding a 42.9% improvement over the baseline on AIME25. Our model, data, and
code will be open-source at https://github.com/zhengkid/Parallel-R1.