Parallel-R1: 강화 학습을 통한 병렬 사고 구현

초록

병렬 사고(Parallel thinking)는 다중 추론 경로를 동시에 탐색함으로써 대규모 언어 모델(LLM)의 추론 능력을 향상시키는 새로운 접근법으로 등장했습니다. 그러나 이러한 능력을 훈련을 통해 활성화하는 것은 여전히 어려운 과제로 남아 있습니다. 기존 방법들은 주로 합성 데이터에 대한 지도 미세 조정(SFT)에 의존하며, 이는 탐색과 일반화보다는 교사 강제 모방을 장려하기 때문입니다. 이와 달리, 우리는 복잡한 실세계 추론 작업에 대해 병렬 사고 행동을 가능하게 하는 최초의 강화 학습(RL) 프레임워크인 Parallel-R1을 제안합니다. 우리의 프레임워크는 RL을 사용한 병렬 사고 훈련에서의 콜드 스타트 문제를 명시적으로 해결하는 점진적 커리큘럼을 채택합니다. 먼저, 더 쉬운 작업에서 프롬프트 생성 궤적에 SFT를 적용하여 병렬 사고 능력을 주입한 후, 더 어려운 문제에서 이 기술을 탐색하고 일반화하기 위해 RL로 전환합니다. MATH, AMC23, AIME 등 다양한 수학 벤치마크에서의 실험 결과, Parallel-R1은 병렬 사고를 성공적으로 주입하여, 도전적인 작업에 대해 RL로 직접 훈련된 순차적 사고 모델보다 8.4%의 정확도 향상을 이끌어냈습니다. 추가 분석은 모델의 사고 행동에서 명확한 변화를 보여줍니다: 초기 단계에서는 병렬 사고를 탐색 전략으로 사용하고, 후기 단계에서는 동일한 능력을 다중 관점 검증에 사용합니다. 가장 중요한 것은, 우리는 병렬 사고가 훈련 중간의 탐색 비계(scaffold)로 검증되었으며, 이 임시 탐색 단계가 RL 이후 더 높은 성능 한계를 열어 AIME25에서 기준선 대비 42.9%의 개선을 가져왔다는 점입니다. 우리의 모델, 데이터, 코드는 https://github.com/zhengkid/Parallel-R1에서 오픈소스로 공개될 예정입니다.

English

Parallel thinking has emerged as a novel approach for enhancing the reasoning capabilities of large language models (LLMs) by exploring multiple reasoning paths concurrently. However, activating such capabilities through training remains challenging, as existing methods predominantly rely on supervised fine-tuning (SFT) over synthetic data, which encourages teacher-forced imitation rather than exploration and generalization. Different from them, we propose Parallel-R1, the first reinforcement learning (RL) framework that enables parallel thinking behaviors for complex real-world reasoning tasks. Our framework employs a progressive curriculum that explicitly addresses the cold-start problem in training parallel thinking with RL. We first use SFT on prompt-generated trajectories from easier tasks to instill the parallel thinking ability, then transition to RL to explore and generalize this skill on harder problems. Experiments on various math benchmarks, including MATH, AMC23, and AIME, show that Parallel-R1 successfully instills parallel thinking, leading to 8.4% accuracy improvements over the sequential thinking model trained directly on challenging tasks with RL. Further analysis reveals a clear shift in the model's thinking behavior: at an early stage, it uses parallel thinking as an exploration strategy, while in a later stage, it uses the same capability for multi-perspective verification. Most significantly, we validate parallel thinking as a mid-training exploration scaffold, where this temporary exploratory phase unlocks a higher performance ceiling after RL, yielding a 42.9% improvement over the baseline on AIME25. Our model, data, and code will be open-source at https://github.com/zhengkid/Parallel-R1.

Parallel-R1: 강화 학습을 통한 병렬 사고 구현

Parallel-R1: Towards Parallel Thinking via Reinforcement Learning

초록

Support