CDE：基於好奇心驅動的探索，實現大型語言模型的高效強化學習

摘要

基於可驗證獎勵的強化學習（RLVR）是一種強大的範式，旨在提升大型語言模型（LLMs）的推理能力。然而，現有的RLVR方法往往探索不足，導致過早收斂和熵崩潰。為應對這一挑戰，我們引入了好奇心驅動探索（CDE）框架，該框架利用模型自身的內在好奇心來引導探索。我們通過從行動者和評論者兩方面獲取的信號來形式化好奇心：對於行動者，我們使用其生成回應的困惑度；對於評論者，我們則採用多頭架構中價值估計的方差。這兩種信號在RLVR框架內作為探索獎勵，用以指導模型。我們的理論分析表明，行動者層面的獎勵本質上懲罰過度自信的錯誤，並促進正確回應的多樣性；此外，我們將評論者層面的獎勵與強化學習中已確立的基於計數的探索獎勵聯繫起來。實證結果顯示，在AIME基準測試中，我們的方法相比使用GRPO/PPO的標準RLVR實現了約3分的提升。進一步分析揭示了RLVR內部的校準崩潰機制，為常見的LLM故障模式提供了新的見解。

English

Reinforcement Learning with Verifiable Rewards (RLVR) is a powerful paradigm for enhancing the reasoning ability of Large Language Models (LLMs). Yet current RLVR methods often explore poorly, leading to premature convergence and entropy collapse. To address this challenge, we introduce Curiosity-Driven Exploration (CDE), a framework that leverages the model's own intrinsic sense of curiosity to guide exploration. We formalize curiosity with signals from both the actor and the critic: for the actor, we use perplexity over its generated response, and for the critic, we use the variance of value estimates from a multi-head architecture. Both signals serve as an exploration bonus within the RLVR framework to guide the model. Our theoretical analysis shows that the actor-wise bonus inherently penalizes overconfident errors and promotes diversity among correct responses; moreover, we connect the critic-wise bonus to the well-established count-based exploration bonus in RL. Empirically, our method achieves an approximate +3 point improvement over standard RLVR using GRPO/PPO on AIME benchmarks. Further analysis identifies a calibration collapse mechanism within RLVR, shedding light on common LLM failure modes.

CDE：基於好奇心驅動的探索，實現大型語言模型的高效強化學習

CDE: Curiosity-Driven Exploration for Efficient Reinforcement Learning in Large Language Models

摘要

Support