Pass@k训练：自适应平衡大型推理模型的探索与利用

摘要

采用可验证奖励的强化学习（RLVR），通常以Pass@1作为奖励指标，在平衡探索与利用方面面临挑战，导致策略倾向于保守行动，陷入局部最优。因此，确定一个合适的奖励度量至关重要。尽管先前的研究在评估中使用了Pass@k，但其与RLVR中大语言模型探索能力的关联却大多被忽视。为探究这一问题，我们首先采用Pass@k作为奖励来训练策略模型（即Pass@k训练），并观察到其探索能力的提升。随后，我们推导出Pass@k训练优势的解析解，从而实现了高效且有效的训练过程。基于此，我们的分析表明，探索与利用并非本质上的对立目标，反而可以相互促进。此外，结合解析推导的Pass@k训练实质上涉及直接设计优势函数。受此启发，我们初步探索了RLVR中的优势设计，展示了积极成果，并指明了一个潜在的未来研究方向。

English

Reinforcement learning with verifiable rewards (RLVR), which typically adopts Pass@1 as the reward, has faced the issues in balancing exploration and exploitation, causing policies to prefer conservative actions, converging to a local optimum. Identifying an appropriate reward metric is therefore crucial. Regarding the prior work, although Pass@k has been used in evaluation, its connection to LLM exploration ability in RLVR remains largely overlooked. To investigate this, we first use Pass@k as the reward to train the policy model (i.e., Pass@k Training), and observe the improvement on its exploration ability. Next, we derive an analytical solution for the advantage of Pass@k Training, leading to an efficient and effective process. Building on this, our analysis reveals that exploration and exploitation are not inherently conflicting objectives, while they can mutually enhance each other. Moreover, Pass@k Training with analytical derivation essentially involves directly designing the advantage function. Inspired by this, we preliminarily explore the advantage design for RLVR, showing promising results and highlighting a potential future direction.

Pass@k训练：自适应平衡大型推理模型的探索与利用

Pass@k Training for Adaptively Balancing Exploration and Exploitation of Large Reasoning Models

摘要

Support