ChatPaper.aiChatPaper

DSDR:面向大语言模型推理探索的双尺度多样性正则化

DSDR: Dual-Scale Diversity Regularization for Exploration in LLM Reasoning

February 23, 2026
作者: Zhongwei Wan, Yun Shen, Zhihao Dou, Donghao Zhou, Yu Zhang, Xin Wang, Hui Shen, Jing Xiong, Chaofan Tao, Zixuan Zhong, Peizhou Huang, Mi Zhang
cs.AI

摘要

基于验证器的强化学习(RLVR)是提升大语言模型推理能力的重要范式,但现有方法常受限于探索不足。策略往往会坍缩到少数推理模式上,过早终止深度探索,而传统的熵正则化仅引入局部随机性,无法实现有意义的路径级多样性,导致基于群组的策略优化信号薄弱且不稳定。我们提出DSDR——一种双尺度多样性正则化强化学习框架,将大语言模型推理的多样性分解为全局与耦合两个组成部分。在全局层面,DSDR促进正确推理轨迹间的多样性以探索不同的解题模式;在局部层面,它对正确轨迹施加长度不变的词元级熵正则化,在保持正确性的同时防止各模式内部的熵坍缩。通过全局到局部的分配机制,两个尺度相互耦合,该机制会对更具区分度的正确轨迹加强局部正则化。我们提供的理论证明表明:DSDR在有界正则化下能保持最优正确性,在群组优化中维持信息丰富的学习信号,并产生理论依据充分的全局-局部耦合规则。在多个推理基准测试上的实验表明,该方法在准确率和pass@k指标上均取得稳定提升,凸显了双尺度多样性对RLVR深度探索的重要性。代码已开源:https://github.com/SUSTechBruce/DSDR。
English
Reinforcement learning with verifiers (RLVR) is a central paradigm for improving large language model (LLM) reasoning, yet existing methods often suffer from limited exploration. Policies tend to collapse onto a few reasoning patterns and prematurely stop deep exploration, while conventional entropy regularization introduces only local stochasticity and fails to induce meaningful path-level diversity, leading to weak and unstable learning signals in group-based policy optimization. We propose DSDR, a Dual-Scale Diversity Regularization reinforcement learning framework that decomposes diversity in LLM reasoning into global and coupling components. Globally, DSDR promotes diversity among correct reasoning trajectories to explore distinct solution modes. Locally, it applies a length-invariant, token-level entropy regularization restricted to correct trajectories, preventing entropy collapse within each mode while preserving correctness. The two scales are coupled through a global-to-local allocation mechanism that emphasizes local regularization for more distinctive correct trajectories. We provide theoretical support showing that DSDR preserves optimal correctness under bounded regularization, sustains informative learning signals in group-based optimization, and yields a principled global-to-local coupling rule. Experiments on multiple reasoning benchmarks demonstrate consistent improvements in accuracy and pass@k, highlighting the importance of dual-scale diversity for deep exploration in RLVR. Code is available at https://github.com/SUSTechBruce/DSDR.
PDF101February 25, 2026