CDE: 大規模言語モデルにおける効率的な強化学習のための好奇心駆動型探索

要旨

検証可能な報酬を用いた強化学習（RLVR）は、大規模言語モデル（LLM）の推論能力を向上させるための強力なパラダイムである。しかし、現在のRLVR手法は探索が不十分であり、早期収束やエントロピーの崩壊を引き起こすことが多い。この課題に対処するため、我々は好奇心駆動型探索（CDE）を導入する。このフレームワークは、モデル自身の内在的な好奇心を活用して探索を導くものである。好奇心を形式化するために、アクターとクリティックの両方からの信号を利用する：アクターに対しては、生成された応答に対するパープレキシティを使用し、クリティックに対しては、マルチヘッドアーキテクチャからの価値推定の分散を使用する。これらの信号は、RLVRフレームワーク内で探索ボーナスとして機能し、モデルを導く。理論的分析により、アクターワイズのボーナスは過信エラーを自然に罰し、正しい応答の多様性を促進することが示される。さらに、クリティックワイズのボーナスは、強化学習における確立されたカウントベースの探索ボーナスと関連付けられる。実験的には、我々の手法は、AIMEベンチマークにおいてGRPO/PPOを使用した標準的なRLVRに対して約+3ポイントの改善を達成する。さらに、RLVR内のキャリブレーション崩壊メカニズムを特定し、一般的なLLMの失敗モードに光を当てる。

English

Reinforcement Learning with Verifiable Rewards (RLVR) is a powerful paradigm for enhancing the reasoning ability of Large Language Models (LLMs). Yet current RLVR methods often explore poorly, leading to premature convergence and entropy collapse. To address this challenge, we introduce Curiosity-Driven Exploration (CDE), a framework that leverages the model's own intrinsic sense of curiosity to guide exploration. We formalize curiosity with signals from both the actor and the critic: for the actor, we use perplexity over its generated response, and for the critic, we use the variance of value estimates from a multi-head architecture. Both signals serve as an exploration bonus within the RLVR framework to guide the model. Our theoretical analysis shows that the actor-wise bonus inherently penalizes overconfident errors and promotes diversity among correct responses; moreover, we connect the critic-wise bonus to the well-established count-based exploration bonus in RL. Empirically, our method achieves an approximate +3 point improvement over standard RLVR using GRPO/PPO on AIME benchmarks. Further analysis identifies a calibration collapse mechanism within RLVR, shedding light on common LLM failure modes.

CDE: 大規模言語モデルにおける効率的な強化学習のための好奇心駆動型探索

CDE: Curiosity-Driven Exploration for Efficient Reinforcement Learning in Large Language Models

要旨

Support