通过强化学习实现大语言模型中的涌现式层次推理

摘要

强化学习（RL）在提升大型语言模型（LLMs）的复杂推理能力方面已展现出显著成效，然而推动这一成功的内在机制仍大多不明。我们的分析揭示，诸如“顿悟时刻”、“长度缩放”及熵动态等令人费解的现象并非孤立事件，而是涌现出的推理层级的标志，类似于人类认知中高层战略规划与低层程序执行的分离。我们发现了一个引人注目的两阶段动态：起初，模型受限于程序正确性，必须提升其低层技能；随后，学习瓶颈发生决定性转移，性能提升转而依赖于高层战略规划的探索与掌握。这一洞见揭示了当前主流RL算法（如GRPO）的核心低效性，这些算法不加区分地施加优化压力，将学习信号分散至所有标记上。为此，我们提出了层级感知信用分配（HICRA），一种专注于高影响力规划标记的优化算法。HICRA显著超越了强基线模型，证明聚焦于这一战略瓶颈是解锁高级推理的关键。此外，我们验证了语义熵作为衡量战略探索的优越指南，相较于易产生误导的标记级熵等指标，其表现更为出色。

English

Reinforcement Learning (RL) has proven highly effective at enhancing the complex reasoning abilities of Large Language Models (LLMs), yet underlying mechanisms driving this success remain largely opaque. Our analysis reveals that puzzling phenomena like ``aha moments", ``length-scaling'' and entropy dynamics are not disparate occurrences but hallmarks of an emergent reasoning hierarchy, akin to the separation of high-level strategic planning from low-level procedural execution in human cognition. We uncover a compelling two-phase dynamic: initially, a model is constrained by procedural correctness and must improve its low-level skills. The learning bottleneck then decisively shifts, with performance gains being driven by the exploration and mastery of high-level strategic planning. This insight exposes a core inefficiency in prevailing RL algorithms like GRPO, which apply optimization pressure agnostically and dilute the learning signal across all tokens. To address this, we propose HIerarchy-Aware Credit Assignment (HICRA), an algorithm that concentrates optimization efforts on high-impact planning tokens. HICRA significantly outperforms strong baselines, demonstrating that focusing on this strategic bottleneck is key to unlocking advanced reasoning. Furthermore, we validate semantic entropy as a superior compass for measuring strategic exploration over misleading metrics such as token-level entropy.

通过强化学习实现大语言模型中的涌现式层次推理

Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning

摘要

Support