RL-PLUS: 하이브리드 정책 최적화를 통해 강화 학습에서 LLM의 능력 경계 붕괴에 대응하기

초록

검증 가능한 보상을 통한 강화 학습(RLVR)은 대규모 언어 모델(LLM)의 복잡한 추론 능력을 크게 발전시켰습니다. 그러나 RLVR은 본질적으로 온-정책 전략과 LLM의 방대한 행동 공간 및 희소한 보상으로 인해 기본 LLM의 고유한 능력 한계를 극복하는 데 어려움을 겪습니다. 특히, RLVR은 능력 한계 붕괴를 초래하여 LLM의 문제 해결 범위를 좁힐 수 있습니다. 이 문제를 해결하기 위해, 우리는 기본 모델의 한계를 넘어서는 더 강력한 추론 능력을 달성하기 위해 내부 활용과 외부 데이터를 시너지적으로 결합한 새로운 하이브리드 정책 최적화 접근법인 RL-PLUS를 제안합니다. RL-PLUS는 두 가지 핵심 구성 요소를 통합합니다. 첫째, 외부 데이터로 인한 분포 불일치를 해결하기 위한 다중 중요도 샘플링(Multiple Importance Sampling)과, 둘째, 모델이 고가치의 탐색되지 않은 추론 경로로 이끌도록 하는 탐색 기반 이점 함수(Exploration-Based Advantage Function)입니다. 우리는 이 접근법의 우수성과 일반화 가능성을 입증하기 위해 이론적 분석과 광범위한 실험을 제공합니다. 기존 RLVR 방법과 비교하여, RL-PLUS는 1) 여섯 개의 수학 추론 벤치마크에서 최첨단 성능을 달성하고, 2) 여섯 개의 분포 외 추론 작업에서 우수한 성능을 보이며, 3) 다양한 모델 패밀리에서 일관되고 상당한 성능 향상을 보여 평균 상대적 개선률이 최대 69.2%에 이릅니다. 또한, Pass@k 곡선 분석은 RL-PLUS가 능력 한계 붕괴 문제를 효과적으로 해결함을 보여줍니다.

English

Reinforcement Learning with Verifiable Reward (RLVR) has significantly advanced the complex reasoning abilities of Large Language Models (LLMs). However, it struggles to break through the inherent capability boundaries of the base LLM, due to its essentially on-policy strategy coupled with LLM's immense action space and sparse reward. Critically, RLVR can lead to the capability boundary collapse, narrowing the LLM's problem-solving scope. To address this problem, we propose RL-PLUS, a novel hybrid-policy optimization approach for LLMs that synergizes internal exploitation with external data to achieve stronger reasoning capabilities and surpass the boundaries of base models. RL-PLUS integrates two core components, i.e., Multiple Importance Sampling to address distributional mismatch from external data, and Exploration-Based Advantage Function to guide the model towards high-value, unexplored reasoning paths. We provide both theoretical analysis and extensive experiments to demonstrate the superiority and generalizability of our approach. Compared with existing RLVR methods, RL-PLUS achieves 1) state-of-the-art performance on six math reasoning benchmarks; 2) superior performance on six out-of-distribution reasoning tasks; 3) consistent and significant gains across diverse model families, with average relative improvements up to 69.2\%. Moreover, the analysis of Pass@k curves indicates that RL-PLUS effectively resolves the capability boundary collapse problem.

RL-PLUS: 하이브리드 정책 최적화를 통해 강화 학습에서 LLM의 능력 경계 붕괴에 대응하기

RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization

초록

Support