DSDR:大型語言模型推理探索中的雙尺度多樣性正則化方法
DSDR: Dual-Scale Diversity Regularization for Exploration in LLM Reasoning
February 23, 2026
作者: Zhongwei Wan, Yun Shen, Zhihao Dou, Donghao Zhou, Yu Zhang, Xin Wang, Hui Shen, Jing Xiong, Chaofan Tao, Zixuan Zhong, Peizhou Huang, Mi Zhang
cs.AI
摘要
基於驗證器的強化學習(RLVR)是改進大型語言模型(LLM)推理的核心範式,然而現有方法常受制於有限的探索能力。策略往往會坍縮至少數推理模式,並過早終止深度探索;而傳統的熵正則化僅能引入局部隨機性,無法產生有意義的路徑級多樣性,導致基於群組的策略優化中出現微弱且不穩定的學習信號。我們提出DSDR——一種雙尺度多樣性正則化強化學習框架,將LLM推理中的多樣性分解為全局與耦合分量。在全局層面,DSDR促進正確推理軌跡間的多樣性以探索不同的解題模式;在局部層面,它對正確軌跡施加長度無關的詞元級熵正則化,在維持正確性的同時防止單一模式內的熵坍縮。兩尺度通過全局至局部的分配機制耦合,該機制對區分度更高的正確軌跡強化局部正則化。我們從理論上證明DSDR能在有界正則化下保持最優正確性,在群組優化中維持具信息量的學習信號,並導出原則性的全局-局部耦合規則。在多個推理基準測試上的實驗表明,該方法在準確率和pass@k指標上均取得穩定提升,凸顯了雙尺度多樣性對RLVR深度探索的關鍵價值。代碼已開源於:https://github.com/SUSTechBruce/DSDR。
English
Reinforcement learning with verifiers (RLVR) is a central paradigm for improving large language model (LLM) reasoning, yet existing methods often suffer from limited exploration. Policies tend to collapse onto a few reasoning patterns and prematurely stop deep exploration, while conventional entropy regularization introduces only local stochasticity and fails to induce meaningful path-level diversity, leading to weak and unstable learning signals in group-based policy optimization. We propose DSDR, a Dual-Scale Diversity Regularization reinforcement learning framework that decomposes diversity in LLM reasoning into global and coupling components. Globally, DSDR promotes diversity among correct reasoning trajectories to explore distinct solution modes. Locally, it applies a length-invariant, token-level entropy regularization restricted to correct trajectories, preventing entropy collapse within each mode while preserving correctness. The two scales are coupled through a global-to-local allocation mechanism that emphasizes local regularization for more distinctive correct trajectories. We provide theoretical support showing that DSDR preserves optimal correctness under bounded regularization, sustains informative learning signals in group-based optimization, and yields a principled global-to-local coupling rule. Experiments on multiple reasoning benchmarks demonstrate consistent improvements in accuracy and pass@k, highlighting the importance of dual-scale diversity for deep exploration in RLVR. Code is available at https://github.com/SUSTechBruce/DSDR.