超越Pass@1：通过变分问题合成的自我对弈持续推动RLVR发展

摘要

可驗證獎勵的強化學習（RLVR）最近已成為訓練後大型語言模型（LLMs）的關鍵範式，特別是在複雜推理任務中。然而，傳統的RLVR訓練已被證明在提升Pass@1性能的同時，會導致策略熵的降低，從而減少生成多樣性並限制Pass@k性能，而Pass@k通常代表LLM推理能力的上限。本文從訓練問題的角度系統分析了策略的生成多樣性，發現增加和更新訓練問題有助於緩解訓練過程中的熵崩潰。基於這些觀察，我們提出了一種用於RLVR訓練的線上自對弈與變分問題合成（SvS）策略，該策略利用策略的正確解來合成變分問題，同時確保其參考答案與原始問題保持一致。這種自我改進策略在訓練過程中有效地保持了策略熵，並與標準RLVR相比顯著提升了Pass@k性能，實現了持續的改進，並在競賽級別的AIME24和AIME25基準測試中分別取得了18.3%和22.8%的Pass@32絕對增益。在從3B到32B不同模型規模的12個推理基準測試中，實驗一致證明了SvS的通用性和魯棒性。

English

Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a key paradigm for post-training Large Language Models (LLMs), particularly for complex reasoning tasks. However, vanilla RLVR training has been shown to improve Pass@1 performance at the expense of policy entropy, leading to reduced generation diversity and limiting the Pass@k performance, which typically represents the upper bound of LLM reasoning capability. In this paper, we systematically analyze the policy's generation diversity from the perspective of training problems and find that augmenting and updating training problems helps mitigate entropy collapse during training. Based on these observations, we propose an online Self-play with Variational problem Synthesis (SvS) strategy for RLVR training, which uses the policy's correct solutions to synthesize variational problems while ensuring their reference answers remain identical to the originals. This self-improving strategy effectively maintains policy entropy during training and substantially improves Pass@k compared with standard RLVR, sustaining prolonged improvements and achieving absolute gains of 18.3% and 22.8% in Pass@32 performance on the competition-level AIME24 and AIME25 benchmarks. Experiments on 12 reasoning benchmarks across varying model sizes from 3B to 32B consistently demonstrate the generalizability and robustness of SvS.

超越Pass@1：通过变分问题合成的自我对弈持续推动RLVR发展

Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR

摘要

Support