超越Pass@1:通过变分问题合成的自我对弈持续推动RLVR发展
Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR
August 19, 2025
作者: Xiao Liang, Zhongzhi Li, Yeyun Gong, Yelong Shen, Ying Nian Wu, Zhijiang Guo, Weizhu Chen
cs.AI
摘要
可驗證獎勵的強化學習(RLVR)最近已成為訓練後大型語言模型(LLMs)的關鍵範式,特別是在複雜推理任務中。然而,傳統的RLVR訓練已被證明在提升Pass@1性能的同時,會導致策略熵的降低,從而減少生成多樣性並限制Pass@k性能,而Pass@k通常代表LLM推理能力的上限。本文從訓練問題的角度系統分析了策略的生成多樣性,發現增加和更新訓練問題有助於緩解訓練過程中的熵崩潰。基於這些觀察,我們提出了一種用於RLVR訓練的線上自對弈與變分問題合成(SvS)策略,該策略利用策略的正確解來合成變分問題,同時確保其參考答案與原始問題保持一致。這種自我改進策略在訓練過程中有效地保持了策略熵,並與標準RLVR相比顯著提升了Pass@k性能,實現了持續的改進,並在競賽級別的AIME24和AIME25基準測試中分別取得了18.3%和22.8%的Pass@32絕對增益。在從3B到32B不同模型規模的12個推理基準測試中,實驗一致證明了SvS的通用性和魯棒性。
English
Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as
a key paradigm for post-training Large Language Models (LLMs), particularly for
complex reasoning tasks. However, vanilla RLVR training has been shown to
improve Pass@1 performance at the expense of policy entropy, leading to reduced
generation diversity and limiting the Pass@k performance, which typically
represents the upper bound of LLM reasoning capability. In this paper, we
systematically analyze the policy's generation diversity from the perspective
of training problems and find that augmenting and updating training problems
helps mitigate entropy collapse during training. Based on these observations,
we propose an online Self-play with Variational problem Synthesis (SvS)
strategy for RLVR training, which uses the policy's correct solutions to
synthesize variational problems while ensuring their reference answers remain
identical to the originals. This self-improving strategy effectively maintains
policy entropy during training and substantially improves Pass@k compared with
standard RLVR, sustaining prolonged improvements and achieving absolute gains
of 18.3% and 22.8% in Pass@32 performance on the competition-level AIME24 and
AIME25 benchmarks. Experiments on 12 reasoning benchmarks across varying model
sizes from 3B to 32B consistently demonstrate the generalizability and
robustness of SvS.