Pass@1을 넘어서: 변이형 문제 합성과 자기 대결을 통한 RLVR의 지속 가능성

초록

검증 가능한 보상을 활용한 강화 학습(Reinforcement Learning with Verifiable Rewards, RLVR)은 최근 대규모 언어 모델(Large Language Models, LLMs)의 사후 훈련, 특히 복잡한 추론 작업을 위한 핵심 패러다임으로 부상하고 있습니다. 그러나 기본적인 RLVR 훈련은 Pass@1 성능을 향상시키는 대신 정책 엔트로피를 감소시켜 생성 다양성을 줄이고, 일반적으로 LLM 추론 능력의 상한을 나타내는 Pass@k 성능을 제한하는 것으로 나타났습니다. 본 논문에서는 훈련 문제의 관점에서 정책의 생성 다양성을 체계적으로 분석하고, 훈련 문제를 보강하고 업데이트하는 것이 훈련 중 엔트로피 붕괴를 완화하는 데 도움이 된다는 사실을 발견했습니다. 이러한 관찰을 바탕으로, 우리는 RLVR 훈련을 위한 온라인 자기 대결과 변형 문제 합성(Self-play with Variational problem Synthesis, SvS) 전략을 제안합니다. 이 전략은 정책의 정확한 해결책을 사용하여 변형 문제를 합성하면서도 참조 답변이 원본과 동일하게 유지되도록 합니다. 이 자기 개선 전략은 훈련 중 정책 엔트로피를 효과적으로 유지하고, 표준 RLVR과 비교하여 Pass@k 성능을 크게 향상시켜 지속적인 개선을 유지하며, 경쟁 수준의 AIME24 및 AIME25 벤치마크에서 Pass@32 성능을 각각 18.3%와 22.8% 절대적으로 향상시켰습니다. 3B에서 32B까지 다양한 모델 크기에 걸친 12개의 추론 벤치마크 실험에서 SvS의 일반화성과 견고성을 일관되게 입증했습니다.

English

Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a key paradigm for post-training Large Language Models (LLMs), particularly for complex reasoning tasks. However, vanilla RLVR training has been shown to improve Pass@1 performance at the expense of policy entropy, leading to reduced generation diversity and limiting the Pass@k performance, which typically represents the upper bound of LLM reasoning capability. In this paper, we systematically analyze the policy's generation diversity from the perspective of training problems and find that augmenting and updating training problems helps mitigate entropy collapse during training. Based on these observations, we propose an online Self-play with Variational problem Synthesis (SvS) strategy for RLVR training, which uses the policy's correct solutions to synthesize variational problems while ensuring their reference answers remain identical to the originals. This self-improving strategy effectively maintains policy entropy during training and substantially improves Pass@k compared with standard RLVR, sustaining prolonged improvements and achieving absolute gains of 18.3% and 22.8% in Pass@32 performance on the competition-level AIME24 and AIME25 benchmarks. Experiments on 12 reasoning benchmarks across varying model sizes from 3B to 32B consistently demonstrate the generalizability and robustness of SvS.

Pass@1을 넘어서: 변이형 문제 합성과 자기 대결을 통한 RLVR의 지속 가능성

Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR

초록

Support