ChatPaper.aiChatPaper

《N界最优:通过max@k优化实现强化学习与Best-of-N采样的协同统一》

The Best of N Worlds: Aligning Reinforcement Learning with Best-of-N Sampling via max@k Optimisation

October 27, 2025
作者: Farid Bagirov, Mikhail Arkhipov, Ksenia Sycheva, Evgeniy Glukhov, Egor Bogomolov
cs.AI

摘要

在数学与编程领域中应用可验证奖励强化学习(RLVR),显著提升了大型语言模型的推理与问题解决能力。尽管该方法在单次生成问题求解中表现成功,但强化学习微调过程可能削弱模型的探索能力——具体表现为生成结果多样性的下降,以及在大N值的最佳N采样(Best-of-N)中随之出现的性能衰减。本研究聚焦于优化max@k指标(pass@k指标的连续泛化形式),推导出可直接优化该指标的无偏策略梯度估计。进一步将推导扩展至现代RLVR算法中常见的离策略更新机制,从而提升样本效率。实证研究表明,我们的目标函数能有效优化离策略场景下的max@k指标,使模型与最佳N推理策略保持一致。
English
The application of Reinforcement Learning with Verifiable Rewards (RLVR) to mathematical and coding domains has demonstrated significant improvements in the reasoning and problem-solving abilities of Large Language Models. Despite its success in single generation problem solving, the reinforcement learning fine-tuning process may harm the model's exploration ability, as reflected in decreased diversity of generations and a resulting degradation of performance during Best-of-N sampling for large N values. In this work, we focus on optimizing the max@k metric, a continuous generalization of pass@k. We derive an unbiased on-policy gradient estimate for direct optimization of this metric. Furthermore, we extend our derivations to the off-policy updates, a common element in modern RLVR algorithms, that allows better sample efficiency. Empirically, we show that our objective effectively optimizes max@k metric in off-policy scenarios, aligning the model with the Best-of-N inference strategy.
PDF201December 31, 2025