探索を伴う推論：エントロピーの視点から

要旨

探索と活用のバランスを取ることは、強化学習（RL）における中心的な目標である。言語モデル（LM）の推論能力を向上させるための最近の進展にもかかわらず、ほとんどの手法は活用に偏っており、性能の頭打ちに直面することが増えている。本研究では、RLにおける探索の信号であるエントロピーを再検討し、LMにおける探索的推論との関係を考察する。実証分析を通じて、高エントロピー領域と3種類の探索的推論行動との間に強い正の相関関係があることを明らかにした：(1) 論理的なステップを決定または接続する重要なトークン、(2) 自己検証や修正などの反省的行動、(3) ベースLMによって十分に探索されていない稀な行動。これに基づき、標準的なRLに最小限の修正を加える方法を提案する。具体的には、利得関数にエントロピーに基づく項を追加するだけで、コードは1行のみである。従来の最大エントロピー法が不確実性を促進することで探索を促すのとは異なり、我々の手法はより長く深い推論連鎖を促進することで探索を促す。特に、本手法はPass@Kメトリック（LMの推論能力の上限推定値）において、極めて大きなK値で評価された場合でも大幅な向上を達成し、LMの推論能力の限界を押し広げることに成功した。

English

Balancing exploration and exploitation is a central goal in reinforcement learning (RL). Despite recent advances in enhancing language model (LM) reasoning, most methods lean toward exploitation, and increasingly encounter performance plateaus. In this work, we revisit entropy -- a signal of exploration in RL -- and examine its relationship to exploratory reasoning in LMs. Through empirical analysis, we uncover strong positive correlations between high-entropy regions and three types of exploratory reasoning actions: (1) pivotal tokens that determine or connect logical steps, (2) reflective actions such as self-verification and correction, and (3) rare behaviors under-explored by the base LMs. Motivated by this, we introduce a minimal modification to standard RL with only one line of code: augmenting the advantage function with an entropy-based term. Unlike traditional maximum-entropy methods which encourage exploration by promoting uncertainty, we encourage exploration by promoting longer and deeper reasoning chains. Notably, our method achieves significant gains on the Pass@K metric -- an upper-bound estimator of LM reasoning capabilities -- even when evaluated with extremely large K values, pushing the boundaries of LM reasoning.

探索を伴う推論：エントロピーの視点から

Reasoning with Exploration: An Entropy Perspective

要旨

Support