Parallel-R1：強化学習による並列思考の実現に向けて

要旨

並列思考は、複数の推論経路を同時に探索することで大規模言語モデル（LLM）の推論能力を向上させる新たなアプローチとして登場しました。しかし、その能力を訓練を通じて活性化することは依然として困難であり、既存の手法は主に合成データを用いた教師付きファインチューニング（SFT）に依存しており、教師強制型の模倣を促す一方で探索と汎化を阻害しています。これらとは異なり、我々は複雑な実世界の推論タスクにおいて並列思考の振る舞いを可能にする初の強化学習（RL）フレームワークであるParallel-R1を提案します。本フレームワークは、RLを用いた並列思考の訓練におけるコールドスタート問題を明示的に解決する漸進的カリキュラムを採用しています。まず、容易なタスクから生成されたプロンプト軌跡に対してSFTを行い、並列思考能力を習得させ、その後RLに移行してより難しい問題に対してこのスキルを探索・汎化させます。MATH、AMC23、AIMEなどの様々な数学ベンチマークでの実験により、Parallel-R1が並列思考を成功裏に習得させ、RLを用いて直接挑戦的なタスクで訓練された逐次思考モデルに対して8.4%の精度向上をもたらすことが示されました。さらに分析を行うと、モデルの思考行動に明確な変化が見られます：初期段階では並列思考を探索戦略として使用し、後期段階では同じ能力を多視点検証に活用します。最も重要な点として、並列思考が訓練中期の探索足場として機能し、この一時的な探索段階がRL後のより高い性能限界を開放し、AIME25においてベースラインに対して42.9%の改善をもたらすことを検証しました。我々のモデル、データ、コードはhttps://github.com/zhengkid/Parallel-R1でオープンソースとして公開されます。

English

Parallel thinking has emerged as a novel approach for enhancing the reasoning capabilities of large language models (LLMs) by exploring multiple reasoning paths concurrently. However, activating such capabilities through training remains challenging, as existing methods predominantly rely on supervised fine-tuning (SFT) over synthetic data, which encourages teacher-forced imitation rather than exploration and generalization. Different from them, we propose Parallel-R1, the first reinforcement learning (RL) framework that enables parallel thinking behaviors for complex real-world reasoning tasks. Our framework employs a progressive curriculum that explicitly addresses the cold-start problem in training parallel thinking with RL. We first use SFT on prompt-generated trajectories from easier tasks to instill the parallel thinking ability, then transition to RL to explore and generalize this skill on harder problems. Experiments on various math benchmarks, including MATH, AMC23, and AIME, show that Parallel-R1 successfully instills parallel thinking, leading to 8.4% accuracy improvements over the sequential thinking model trained directly on challenging tasks with RL. Further analysis reveals a clear shift in the model's thinking behavior: at an early stage, it uses parallel thinking as an exploration strategy, while in a later stage, it uses the same capability for multi-perspective verification. Most significantly, we validate parallel thinking as a mid-training exploration scaffold, where this temporary exploratory phase unlocks a higher performance ceiling after RL, yielding a 42.9% improvement over the baseline on AIME25. Our model, data, and code will be open-source at https://github.com/zhengkid/Parallel-R1.

Parallel-R1：強化学習による並列思考の実現に向けて

Parallel-R1: Towards Parallel Thinking via Reinforcement Learning

要旨

Support