推理如何流动?追踪注意力诱导的信息流以实现大语言模型中的目标强化学习
How Does Reasoning Flow? Tracing Attention-Induced Information Flow for Targeted RL in LLMs
June 9, 2026
作者: Zhichen Dong, Yang Li, Yuhan Sun, Weixun Wang, Yijia Luo, Zinian Peng, Taiheng Ye, Chao Yang, Wenbo Su, Yu Cheng, Bo Zheng, Junchi Yan
cs.AI
摘要
词元级信用分配仍是强化学习在大语言模型中的关键障碍——现有的强化学习方案通常将所有词元一视同仁,无法区分决定性的推理步骤与常规格式或流畅的填充内容。近期研究尝试利用模型内部信号实现更细粒度的信用分配,但这些方法常采用忽略信息传播全局结构的点式启发式规则。为此,我们提出FlowTracer框架,该框架在注意力导向的无环图上追踪面向答案的推理流——图中节点对应词元,边容量来自聚合的注意力权重——并基于这一全局结构推导词元信用。边容量经过重新加权,仅保留能到达答案区域的影响力,同时强制执行局部流守恒,使得中间词元不会因路径长度或无关分支而产生有效质量的增减。在此图上,FlowTracer提取连接问题与答案的信息流主干,并通过流吞吐量为词元评分,从而揭示调节长程依赖关系的高影响力枢纽与聚合检查点。这些导出的重要性被用于构建词元级奖励,使学习信号能够精准聚焦于将信息导向(或偏离)正确答案的词元,在各类推理任务中持续带来性能提升。
English
Token-level credit assignment remains a key obstacle for reinforcement learning (RL) in large language models (LLMs), where RL recipes typically treat all tokens equally, failing to distinguish decisive reasoning steps from routine formatting or fluent filler. Recent attempts leverage model-internal signals to assign finer-grained credit, but these are often point-wise heuristics that ignore the global structure of information propagation. We propose FlowTracer, an RL framework that traces answer-targeted reasoning flow on an attention-induced directed acyclic graph in which nodes correspond to tokens and edge capacities come from aggregated attention weights and derives token credit from this global structure. The edge capacities are reweighted to retain only the influence that can reach the answer region, while enforcing local flow conservation so intermediate tokens neither lose nor gain effective mass due to path length or irrelevant branches. On this graph, FlowTracer extracts an information-flow backbone connecting the question to the answer and scores tokens by flow throughput, revealing high-impact hubs and aggregation checkpoints that mediate long-range dependencies. These derived importances are used to shape token-level rewards, enabling learning signals to focus precisely on the tokens that route information toward (or away from) correct answers and delivering consistent performance gains across a range of reasoning tasks.