超越Pass@1：通过变分问题合成的自我对弈持续推动RLVR发展

摘要

可验证奖励的强化学习（RLVR）近来已成为大型语言模型（LLMs）后训练阶段的关键范式，尤其在处理复杂推理任务时。然而，基础RLVR训练虽能提升Pass@1表现，却以策略熵的降低为代价，导致生成多样性减少，从而限制了代表LLM推理能力上限的Pass@k性能。本文从训练问题的角度系统分析了策略的生成多样性，发现增强并更新训练问题有助于缓解训练过程中的熵塌缩。基于这些观察，我们提出了一种在线自博弈与变分问题合成（SvS）策略用于RLVR训练，该策略利用策略的正确解答合成变分问题，同时确保其参考答案与原问题保持一致。这一自我提升策略有效维持了训练期间的策略熵，相比标准RLVR显著提升了Pass@k表现，实现了持续改进，并在竞赛级别的AIME24和AIME25基准测试中，Pass@32性能分别取得了18.3%和22.8%的绝对提升。在3B至32B不同模型规模下的12个推理基准测试中，实验一致证明了SvS的普适性与鲁棒性。

English

Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a key paradigm for post-training Large Language Models (LLMs), particularly for complex reasoning tasks. However, vanilla RLVR training has been shown to improve Pass@1 performance at the expense of policy entropy, leading to reduced generation diversity and limiting the Pass@k performance, which typically represents the upper bound of LLM reasoning capability. In this paper, we systematically analyze the policy's generation diversity from the perspective of training problems and find that augmenting and updating training problems helps mitigate entropy collapse during training. Based on these observations, we propose an online Self-play with Variational problem Synthesis (SvS) strategy for RLVR training, which uses the policy's correct solutions to synthesize variational problems while ensuring their reference answers remain identical to the originals. This self-improving strategy effectively maintains policy entropy during training and substantially improves Pass@k compared with standard RLVR, sustaining prolonged improvements and achieving absolute gains of 18.3% and 22.8% in Pass@32 performance on the competition-level AIME24 and AIME25 benchmarks. Experiments on 12 reasoning benchmarks across varying model sizes from 3B to 32B consistently demonstrate the generalizability and robustness of SvS.

超越Pass@1：通过变分问题合成的自我对弈持续推动RLVR发展

Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR

摘要

Support