探索中的推理：熵視角

摘要

在強化學習（RL）中，平衡探索與利用是一個核心目標。儘管近期在提升語言模型（LM）推理能力方面取得了進展，但大多數方法偏向於利用，且越來越頻繁地遭遇性能瓶頸。在本研究中，我們重新審視了熵——RL中探索的信號——並探討其與LM中探索性推理的關係。通過實證分析，我們發現高熵區域與三種類型的探索性推理行為之間存在強烈的正相關：(1) 決定或連接邏輯步驟的關鍵詞彙，(2) 如自我驗證和修正等反思行為，以及(3) 基礎LM未充分探索的罕見行為。基於此，我們對標準RL進行了最小程度的修改，僅增加一行代碼：在優勢函數中加入基於熵的項。與傳統的最大熵方法通過促進不確定性來鼓勵探索不同，我們通過促進更長、更深的推理鏈來鼓勵探索。值得注意的是，即使在極大的K值下評估，我們的方法在Pass@K指標——LM推理能力的上界估計器——上仍取得了顯著提升，推動了LM推理的邊界。

English

Balancing exploration and exploitation is a central goal in reinforcement learning (RL). Despite recent advances in enhancing language model (LM) reasoning, most methods lean toward exploitation, and increasingly encounter performance plateaus. In this work, we revisit entropy -- a signal of exploration in RL -- and examine its relationship to exploratory reasoning in LMs. Through empirical analysis, we uncover strong positive correlations between high-entropy regions and three types of exploratory reasoning actions: (1) pivotal tokens that determine or connect logical steps, (2) reflective actions such as self-verification and correction, and (3) rare behaviors under-explored by the base LMs. Motivated by this, we introduce a minimal modification to standard RL with only one line of code: augmenting the advantage function with an entropy-based term. Unlike traditional maximum-entropy methods which encourage exploration by promoting uncertainty, we encourage exploration by promoting longer and deeper reasoning chains. Notably, our method achieves significant gains on the Pass@K metric -- an upper-bound estimator of LM reasoning capabilities -- even when evaluated with extremely large K values, pushing the boundaries of LM reasoning.

探索中的推理：熵視角

Reasoning with Exploration: An Entropy Perspective

摘要

Support