探索式推理：一种熵视角

摘要

在强化学习（RL）中，平衡探索与利用是一个核心目标。尽管近期在提升语言模型（LM）推理能力方面取得了进展，但大多数方法倾向于利用，且日益遭遇性能瓶颈。在本研究中，我们重新审视了熵——RL中探索的信号——并探讨其与LM中探索性推理的关系。通过实证分析，我们发现高熵区域与三类探索性推理行为之间存在显著的正相关关系：（1）决定或连接逻辑步骤的关键词，（2）如自我验证与修正等反思行为，以及（3）基础LM未充分探索的罕见行为。受此启发，我们对标准RL进行了最小化修改，仅增加一行代码：在优势函数中加入基于熵的项。与传统的最大熵方法通过促进不确定性来鼓励探索不同，我们通过促进更长、更深的推理链来鼓励探索。值得注意的是，即使在评估时采用极大的K值，我们的方法在Pass@K指标——LM推理能力的一个上界估计器——上仍取得了显著提升，从而推动了LM推理的边界。

English

Balancing exploration and exploitation is a central goal in reinforcement learning (RL). Despite recent advances in enhancing language model (LM) reasoning, most methods lean toward exploitation, and increasingly encounter performance plateaus. In this work, we revisit entropy -- a signal of exploration in RL -- and examine its relationship to exploratory reasoning in LMs. Through empirical analysis, we uncover strong positive correlations between high-entropy regions and three types of exploratory reasoning actions: (1) pivotal tokens that determine or connect logical steps, (2) reflective actions such as self-verification and correction, and (3) rare behaviors under-explored by the base LMs. Motivated by this, we introduce a minimal modification to standard RL with only one line of code: augmenting the advantage function with an entropy-based term. Unlike traditional maximum-entropy methods which encourage exploration by promoting uncertainty, we encourage exploration by promoting longer and deeper reasoning chains. Notably, our method achieves significant gains on the Pass@K metric -- an upper-bound estimator of LM reasoning capabilities -- even when evaluated with extremely large K values, pushing the boundaries of LM reasoning.

探索式推理：一种熵视角

Reasoning with Exploration: An Entropy Perspective

摘要

Support