分歧選擇:一個被忽視的關鍵,用於在可驗證獎勵的強化學習中緩解多樣性崩潰
The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward
September 9, 2025
作者: Long Li, Jiaran Hao, Jason Klein Liu, Zhijian Zhou, Xiaoyu Tan, Wei Chu, Zhe Wang, Shirui Pan, Chao Qu, Yuan Qi
cs.AI
摘要
在使用可驗證獎勵的強化學習(RLVR)微調大型語言模型(LLMs)時,一個核心矛盾在於,儘管單次嘗試準確率(Pass@1)有所提升,但多次嘗試性能(Pass@k)卻經常下降。這種現象往往伴隨著災難性遺忘,即模型喪失先前習得的技能。雖然已有各種方法被提出,但作為主動解決方案,分歧項的選擇和功能卻出人意料地未被深入探討。我們認為,標準的RLVR目標——無論是使用模式尋求的反向KL分歧,還是完全放棄分歧項——都缺乏一個關鍵的知識保留機制。反向KL通過縮窄策略主動加速了這種衰退,而完全放棄分歧項則無法防止模型偏離其多樣化的知識基礎。我們提出了一個根本性的視角轉變:將分歧項本身作為解決方案。我們的框架,多樣性保持混合強化學習(DPH-RL),利用質量覆蓋的f-分歧(如正向KL和JS分歧)作為複習機制。通過持續參考初始策略,這種方法迫使模型保持廣泛的解決方案覆蓋範圍。在數學和SQL生成上的大量實驗表明,DPH-RL不僅解決了Pass@k的下降問題,還提升了域內和域外的Pass@1和Pass@k性能。此外,DPH-RL在訓練效率上更高,因為它使用生成函數計算f-分歧,僅需從初始策略中採樣,而無需在線參考模型。我們的工作強調了改進RLVR的一個關鍵但被忽視的維度,證明了正確選擇分歧度量是構建更通用和多樣化推理模型的強大工具。
English
A central paradox in fine-tuning Large Language Models (LLMs) with
Reinforcement Learning with Verifiable Reward (RLVR) is the frequent
degradation of multi-attempt performance (Pass@k) despite improvements in
single-attempt accuracy (Pass@1). This is often accompanied by catastrophic
forgetting, where models lose previously acquired skills. While various methods
have been proposed, the choice and function of the divergence term have been
surprisingly unexamined as a proactive solution. We argue that standard RLVR
objectives -- both those using the mode-seeking reverse KL-divergence and those
forgoing a divergence term entirely -- lack a crucial mechanism for knowledge
retention. The reverse-KL actively accelerates this decay by narrowing the
policy, while its absence provides no safeguard against the model drifting from
its diverse knowledge base. We propose a fundamental shift in perspective:
using the divergence term itself as the solution. Our framework,
Diversity-Preserving Hybrid RL (DPH-RL), leverages mass-covering f-divergences
(like forward-KL and JS-divergence) to function as a rehearsal mechanism. By
continuously referencing the initial policy, this approach forces the model to
maintain broad solution coverage. Extensive experiments on math and SQL
generation demonstrate that DPH-RL not only resolves the Pass@k degradation but
improves both Pass@1 and Pass@k in- and out-of-domain. Additionally, DPH-RL is
more training-efficient because it computes f-divergence using generator
functions, requiring only sampling from the initial policy and no online
reference model. Our work highlights a crucial, overlooked axis for improving
RLVR, demonstrating that the proper selection of a divergence measure is a
powerful tool for building more general and diverse reasoning models.