ChatPaper.aiChatPaper

注意力机制点亮大语言模型推理:预规划与锚定节奏实现细粒度策略优化

Attention Illuminates LLM Reasoning: The Preplan-and-Anchor Rhythm Enables Fine-Grained Policy Optimization

October 15, 2025
作者: Yang Li, Zhichen Dong, Yuhan Sun, Weixun Wang, Shaopan Xiong, Yijia Luo, Jiashun Liu, Han Lu, Jiamang Wang, Wenbo Su, Bo Zheng, Junchi Yan
cs.AI

摘要

大型语言模型(LLMs)的推理模式仍不透明,而强化学习(RL)通常在整个生成过程中均匀分配信用,模糊了关键步骤与常规步骤之间的界限。本研究将注意力机制定位为一种特权基础,它不仅作为计算的副产品,更作为推理本身的机制蓝图,使LLMs的内部逻辑变得可解读。我们首先区分了局部聚焦与全局聚焦的信息处理注意力头,揭示出局部聚焦头在接近对角线处产生锯齿状模式,指示短语块;而全局聚焦头则暴露了对未来标记具有广泛下游影响的标记。我们通过两个指标形式化这些发现:1)窗口平均注意力距离,衡量在裁剪窗口内的向后注意力程度;2)未来注意力影响,量化一个标记的全局重要性,即其从后续标记接收到的平均注意力。综合这些信号,我们发现了一种反复出现的预规划与锚定机制,模型首先进行长距离上下文参考以生成引导标记,随后立即或同时出现一个语义锚定标记,组织后续推理。基于这些洞见,我们引入了三种新颖的RL策略,动态地对关键节点(预规划标记、锚定标记及其时间耦合)进行针对性信用分配,并在多种推理任务中展示了持续的性能提升。通过将优化与模型内在的推理节奏对齐,我们旨在将不透明的优化转变为一种可操作的结构感知过程,希望能为LLM推理的更加透明和有效优化迈出潜在的一步。
English
The reasoning pattern of Large language models (LLMs) remains opaque, and Reinforcement learning (RL) typically applies uniform credit across an entire generation, blurring the distinction between pivotal and routine steps. This work positions attention as a privileged substrate that renders the internal logic of LLMs legible, not merely as a byproduct of computation, but as a mechanistic blueprint of reasoning itself. We first distinguish attention heads between locally and globally focused information processing and reveal that locally focused heads produce a sawtooth pattern near the diagonal indicating phrasal chunks, while globally focused heads expose tokens that exert broad downstream influence over future tokens. We formalize these with two metrics: 1) Windowed Average Attention Distance, which measures the extent of backward attention within a clipped window; 2) Future Attention Influence, which quantifies a token's global importance as the average attention it receives from subsequent tokens. Taken together, these signals reveal a recurring preplan-and-anchor mechanism, where the model first performs a long-range contextual reference to generate an introductory token, which is immediately followed by or coincides with a semantic anchor token that organizes subsequent reasoning. Leveraging these insights, we introduce three novel RL strategies that dynamically perform targeted credit assignment to critical nodes (preplan tokens, anchor tokens, and their temporal coupling) and show consistent performance gains across various reasoning tasks. By aligning optimization with the model's intrinsic reasoning rhythm, we aim to transform opaque optimization into an actionable structure-aware process, hoping to offer a potential step toward more transparent and effective optimization of LLM reasoning.
PDF542October 16, 2025