差异度选择:强化学习中可验证奖励下缓解多样性崩溃的关键被忽视因素
The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward
September 9, 2025
作者: Long Li, Jiaran Hao, Jason Klein Liu, Zhijian Zhou, Xiaoyu Tan, Wei Chu, Zhe Wang, Shirui Pan, Chao Qu, Yuan Qi
cs.AI
摘要
在利用可验证奖励的强化学习(RLVR)对大型语言模型(LLMs)进行微调时,一个核心悖论是:尽管单次尝试准确率(Pass@1)有所提升,但多次尝试性能(Pass@k)却常常下降。这一现象往往伴随着灾难性遗忘,即模型丧失了先前习得的技能。尽管已有多种方法被提出,但作为主动解决方案,散度项的选择与作用却出人意料地未被深入探讨。我们认为,标准的RLVR目标——无论是采用模式寻求的反向KL散度,还是完全舍弃散度项——都缺乏一个关键的知识保留机制。反向KL通过收窄策略加速了这种性能衰退,而散度项的缺失则无法防止模型偏离其多样化的知识基础。我们提出了一种根本性的视角转变:将散度项本身作为解决方案。我们的框架——多样性保持混合强化学习(DPH-RL)——利用质量覆盖的f-散度(如正向KL和JS散度)作为复习机制。通过持续参考初始策略,该方法迫使模型保持广泛的解决方案覆盖。在数学和SQL生成任务上的大量实验表明,DPH-RL不仅解决了Pass@k的退化问题,还提升了域内外的Pass@1和Pass@k性能。此外,DPH-RL在训练效率上更优,因为它通过生成函数计算f-散度,仅需从初始策略中采样,无需在线参考模型。我们的工作强调了改进RLVR的一个关键且被忽视的维度,证明了正确选择散度度量是构建更通用、更多样化推理模型的有力工具。
English
A central paradox in fine-tuning Large Language Models (LLMs) with
Reinforcement Learning with Verifiable Reward (RLVR) is the frequent
degradation of multi-attempt performance (Pass@k) despite improvements in
single-attempt accuracy (Pass@1). This is often accompanied by catastrophic
forgetting, where models lose previously acquired skills. While various methods
have been proposed, the choice and function of the divergence term have been
surprisingly unexamined as a proactive solution. We argue that standard RLVR
objectives -- both those using the mode-seeking reverse KL-divergence and those
forgoing a divergence term entirely -- lack a crucial mechanism for knowledge
retention. The reverse-KL actively accelerates this decay by narrowing the
policy, while its absence provides no safeguard against the model drifting from
its diverse knowledge base. We propose a fundamental shift in perspective:
using the divergence term itself as the solution. Our framework,
Diversity-Preserving Hybrid RL (DPH-RL), leverages mass-covering f-divergences
(like forward-KL and JS-divergence) to function as a rehearsal mechanism. By
continuously referencing the initial policy, this approach forces the model to
maintain broad solution coverage. Extensive experiments on math and SQL
generation demonstrate that DPH-RL not only resolves the Pass@k degradation but
improves both Pass@1 and Pass@k in- and out-of-domain. Additionally, DPH-RL is
more training-efficient because it computes f-divergence using generator
functions, requiring only sampling from the initial policy and no online
reference model. Our work highlights a crucial, overlooked axis for improving
RLVR, demonstrating that the proper selection of a divergence measure is a
powerful tool for building more general and diverse reasoning models.