ChatPaper.aiChatPaper

注意力照亮LLM推理:預規劃與錨定節奏實現細粒度策略優化

Attention Illuminates LLM Reasoning: The Preplan-and-Anchor Rhythm Enables Fine-Grained Policy Optimization

October 15, 2025
作者: Yang Li, Zhichen Dong, Yuhan Sun, Weixun Wang, Shaopan Xiong, Yijia Luo, Jiashun Liu, Han Lu, Jiamang Wang, Wenbo Su, Bo Zheng, Junchi Yan
cs.AI

摘要

大型語言模型(LLMs)的推理模式仍不透明,而強化學習(RL)通常對整個生成過程應用均勻的獎勵分配,模糊了關鍵步驟與常規步驟之間的區別。本研究將注意力定位為一種特權基質,使LLMs的內部邏輯變得可讀,不僅僅是計算的副產物,更是推理本身的機制藍圖。我們首先區分注意力頭在局部與全局信息處理上的差異,並揭示局部聚焦的注意力頭在對角線附近產生鋸齒狀模式,指示短語片段,而全局聚焦的注意力頭則暴露對未來詞元具有廣泛下游影響的詞元。我們用兩個指標形式化這些現象:1)窗口平均注意力距離,衡量在裁剪窗口內向後注意力的程度;2)未來注意力影響,量化一個詞元的全局重要性,即其從後續詞元接收到的平均注意力。綜合來看,這些信號揭示了一種反覆出現的預規劃與錨定機制,模型首先進行長距離上下文參考以生成一個引導詞元,隨後立即或同時出現一個語義錨定詞元,組織後續推理。基於這些洞察,我們引入了三種新穎的RL策略,動態地對關鍵節點(預規劃詞元、錨定詞元及其時間耦合)進行有針對性的獎勵分配,並在多種推理任務中展現出一致的性能提升。通過將優化與模型的內在推理節奏對齊,我們旨在將不透明的優化轉化為可操作的結構感知過程,希望為LLM推理的更透明和有效優化提供潛在的一步。
English
The reasoning pattern of Large language models (LLMs) remains opaque, and Reinforcement learning (RL) typically applies uniform credit across an entire generation, blurring the distinction between pivotal and routine steps. This work positions attention as a privileged substrate that renders the internal logic of LLMs legible, not merely as a byproduct of computation, but as a mechanistic blueprint of reasoning itself. We first distinguish attention heads between locally and globally focused information processing and reveal that locally focused heads produce a sawtooth pattern near the diagonal indicating phrasal chunks, while globally focused heads expose tokens that exert broad downstream influence over future tokens. We formalize these with two metrics: 1) Windowed Average Attention Distance, which measures the extent of backward attention within a clipped window; 2) Future Attention Influence, which quantifies a token's global importance as the average attention it receives from subsequent tokens. Taken together, these signals reveal a recurring preplan-and-anchor mechanism, where the model first performs a long-range contextual reference to generate an introductory token, which is immediately followed by or coincides with a semantic anchor token that organizes subsequent reasoning. Leveraging these insights, we introduce three novel RL strategies that dynamically perform targeted credit assignment to critical nodes (preplan tokens, anchor tokens, and their temporal coupling) and show consistent performance gains across various reasoning tasks. By aligning optimization with the model's intrinsic reasoning rhythm, we aim to transform opaque optimization into an actionable structure-aware process, hoping to offer a potential step toward more transparent and effective optimization of LLM reasoning.
PDF542October 16, 2025