DSDR: LLM 추론 탐색을 위한 이중 규모 다양성 정규화

초록

검증 기반 강화 학습(RLVR)은 대규모 언어 모델(LLM)의 추론 능력을 향상시키기 위한 핵심 패러다임이지만, 기존 방법론은 제한된 탐색 문제로 어려움을 겪는 경우가 많습니다. 정책은 소수의 추론 패턴에 고정되거나 깊은 탐색을 조기에 중단하는 경향이 있으며, 기존의 엔트로피 정규화는 지역적 무작위성만을 도입하여 의미 있는 경로 수준의 다양성을 유도하지 못해 그룹 기반 정책 최적화에서 약하고 불안정한 학습 신호를 초래합니다. 본 논문에서는 DSDR(Dual-Scale Diversity Regularization)이라는 이중 규모 다양성 정규화 강화 학습 프레임워크를 제안합니다. DSDR은 LLM 추론의 다양성을 전역적 요소와 결합적 요소로 분해합니다. 전역적으로 DSDR은 올바른 추론 경로들 간의 다양성을 촉진하여 서로 다른 해결 방식을 탐색합니다. 지역적으로는 올바른 경로에 대해서만 길이에 무관한 토큰 수준의 엔트로피 정규화를 적용하여 각 방식 내에서 엔트로피 붕괴를 방지하면서 정확성을 유지합니다. 두 규모는 더욱 독특한 올바른 경로에 대해 지역 정규화를 강조하는 전역-지역 할당 메커니즘을 통해 결합됩니다. 우리는 DSDR이 유계 정규화 하에서 최적의 정확성을 보존하며, 그룹 기반 최적화에서 정보적 가치가 높은 학습 신호를 유지하고, 원칙에 기반한 전역-지역 결합 규칙을 제공함을 이론적으로 입증합니다. 다중 추론 벤치마크 실험을 통해 정확도와 pass@k 지표에서 일관된 성능 향상을 확인하였으며, 이는 RLVR의 깊은 탐색을 위한 이중 규모 다양성의 중요성을 부각시킵니다. 코드는 https://github.com/SUSTechBruce/DSDR에서 이용 가능합니다.

English

Reinforcement learning with verifiers (RLVR) is a central paradigm for improving large language model (LLM) reasoning, yet existing methods often suffer from limited exploration. Policies tend to collapse onto a few reasoning patterns and prematurely stop deep exploration, while conventional entropy regularization introduces only local stochasticity and fails to induce meaningful path-level diversity, leading to weak and unstable learning signals in group-based policy optimization. We propose DSDR, a Dual-Scale Diversity Regularization reinforcement learning framework that decomposes diversity in LLM reasoning into global and coupling components. Globally, DSDR promotes diversity among correct reasoning trajectories to explore distinct solution modes. Locally, it applies a length-invariant, token-level entropy regularization restricted to correct trajectories, preventing entropy collapse within each mode while preserving correctness. The two scales are coupled through a global-to-local allocation mechanism that emphasizes local regularization for more distinctive correct trajectories. We provide theoretical support showing that DSDR preserves optimal correctness under bounded regularization, sustains informative learning signals in group-based optimization, and yields a principled global-to-local coupling rule. Experiments on multiple reasoning benchmarks demonstrate consistent improvements in accuracy and pass@k, highlighting the importance of dual-scale diversity for deep exploration in RLVR. Code is available at https://github.com/SUSTechBruce/DSDR.

DSDR: LLM 추론 탐색을 위한 이중 규모 다양성 정규화

DSDR: Dual-Scale Diversity Regularization for Exploration in LLM Reasoning

초록

Support