CDE：面向大规模语言模型高效强化学习的好奇心驱动探索

摘要

基于可验证奖励的强化学习（RLVR）是提升大语言模型（LLMs）推理能力的一种强大范式。然而，当前的RLVR方法在探索方面表现欠佳，常导致过早收敛和熵崩溃。为解决这一挑战，我们引入了好奇心驱动探索（CDE）框架，该框架利用模型自身的内在好奇心来引导探索。我们通过来自执行者和评价者的信号形式化好奇心：对于执行者，我们使用其生成响应的困惑度；对于评价者，我们采用多头架构中价值估计的方差。这两种信号在RLVR框架内作为探索奖励，以指导模型。我们的理论分析表明，执行者层面的奖励本质上惩罚了过度自信的错误，并促进了正确回答的多样性；此外，我们将评价者层面的奖励与强化学习中已确立的基于计数的探索奖励联系起来。实证结果显示，在AIME基准测试中，相较于使用GRPO/PPO的标准RLVR方法，我们的方法实现了约+3分的提升。进一步分析揭示了RLVR中的校准崩溃机制，为常见的LLM失效模式提供了新的见解。

English

Reinforcement Learning with Verifiable Rewards (RLVR) is a powerful paradigm for enhancing the reasoning ability of Large Language Models (LLMs). Yet current RLVR methods often explore poorly, leading to premature convergence and entropy collapse. To address this challenge, we introduce Curiosity-Driven Exploration (CDE), a framework that leverages the model's own intrinsic sense of curiosity to guide exploration. We formalize curiosity with signals from both the actor and the critic: for the actor, we use perplexity over its generated response, and for the critic, we use the variance of value estimates from a multi-head architecture. Both signals serve as an exploration bonus within the RLVR framework to guide the model. Our theoretical analysis shows that the actor-wise bonus inherently penalizes overconfident errors and promotes diversity among correct responses; moreover, we connect the critic-wise bonus to the well-established count-based exploration bonus in RL. Empirically, our method achieves an approximate +3 point improvement over standard RLVR using GRPO/PPO on AIME benchmarks. Further analysis identifies a calibration collapse mechanism within RLVR, shedding light on common LLM failure modes.

CDE：面向大规模语言模型高效强化学习的好奇心驱动探索

CDE: Curiosity-Driven Exploration for Efficient Reinforcement Learning in Large Language Models

摘要

Support