透過強化學習實現大型語言模型中的層次化推理能力
Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning
September 3, 2025
作者: Haozhe Wang, Qixin Xu, Che Liu, Junhong Wu, Fangzhen Lin, Wenhu Chen
cs.AI
摘要
強化學習(Reinforcement Learning, RL)在提升大型語言模型(Large Language Models, LLMs)的複雜推理能力方面已展現出極高的成效,然而驅動此成功的內在機制仍大多未明。我們的分析揭示,諸如「頓悟時刻」、「長度縮放」及熵動態等令人費解的現象,並非孤立事件,而是湧現推理層次結構的標誌,這與人類認知中高層次策略規劃與低層次程序執行之分離相似。我們發現了一個引人注目的兩階段動態:最初,模型受制於程序正確性,必須提升其低層次技能。隨後,學習瓶頸發生決定性轉移,性能提升轉由探索與掌握高層次策略規劃所驅動。這一洞見揭示了現行RL算法(如GRPO)中的核心低效性,這些算法不分青紅皂白地施加優化壓力,稀釋了所有詞元的學習信號。為解決此問題,我們提出了層次感知信用分配(HIerarchy-Aware Credit Assignment, HICRA),該算法將優化努力集中於高影響力的規劃詞元上。HICRA顯著超越了強基準,證明了聚焦於此策略瓶頸是解鎖高級推理的關鍵。此外,我們驗證了語義熵作為衡量策略探索的優越指南,相較於詞元級熵等誤導性指標,其表現更為出色。
English
Reinforcement Learning (RL) has proven highly effective at enhancing the
complex reasoning abilities of Large Language Models (LLMs), yet underlying
mechanisms driving this success remain largely opaque. Our analysis reveals
that puzzling phenomena like ``aha moments", ``length-scaling'' and entropy
dynamics are not disparate occurrences but hallmarks of an emergent reasoning
hierarchy, akin to the separation of high-level strategic planning from
low-level procedural execution in human cognition. We uncover a compelling
two-phase dynamic: initially, a model is constrained by procedural correctness
and must improve its low-level skills. The learning bottleneck then decisively
shifts, with performance gains being driven by the exploration and mastery of
high-level strategic planning. This insight exposes a core inefficiency in
prevailing RL algorithms like GRPO, which apply optimization pressure
agnostically and dilute the learning signal across all tokens. To address this,
we propose HIerarchy-Aware Credit Assignment (HICRA), an algorithm that
concentrates optimization efforts on high-impact planning tokens. HICRA
significantly outperforms strong baselines, demonstrating that focusing on this
strategic bottleneck is key to unlocking advanced reasoning. Furthermore, we
validate semantic entropy as a superior compass for measuring strategic
exploration over misleading metrics such as token-level entropy.