推理如何流動?追蹤注意力誘導的資訊流動以實現大型語言模型中的目標導向強化學習
How Does Reasoning Flow? Tracing Attention-Induced Information Flow for Targeted RL in LLMs
June 9, 2026
作者: Zhichen Dong, Yang Li, Yuhan Sun, Weixun Wang, Yijia Luo, Zinian Peng, Taiheng Ye, Chao Yang, Wenbo Su, Yu Cheng, Bo Zheng, Junchi Yan
cs.AI
摘要
令牌級信用分配仍是大語言模型(LLMs)中強化學習(RL)的主要障礙,現有的RL方法通常將所有令牌一視同仁,未能區分決定性的推理步驟與常規格式或流暢填充詞。近期研究嘗試利用模型內部信號進行更細粒度的信用分配,但這些方法多為逐點啟發式,忽略了資訊傳播的整體結構。我們提出FlowTracer,這是一個基於注意力誘導的有向無環圖(DAG)追蹤答案導向推理流程的RL框架,其中節點對應令牌,邊容量來自聚合注意力權重,並從此整體結構推導令牌信用。邊容量經重新加權,僅保留能抵達答案區域的影響力,同時強制局部流量守恆,使中間令牌不會因路徑長度或無關分支而損失或增加有效質量。在此圖上,FlowTracer提取連結問題到答案的資訊流主幹,並根據流通量對令牌評分,揭示調解長距依賴的高影響力樞紐與聚合檢查點。這些推導出的重要性用於塑造令牌級獎勵,使學習信號能精準聚焦於將資訊導向(或偏離)正確答案的令牌,並在各種推理任務中帶來一致的性能提升。
English
Token-level credit assignment remains a key obstacle for reinforcement learning (RL) in large language models (LLMs), where RL recipes typically treat all tokens equally, failing to distinguish decisive reasoning steps from routine formatting or fluent filler. Recent attempts leverage model-internal signals to assign finer-grained credit, but these are often point-wise heuristics that ignore the global structure of information propagation. We propose FlowTracer, an RL framework that traces answer-targeted reasoning flow on an attention-induced directed acyclic graph in which nodes correspond to tokens and edge capacities come from aggregated attention weights and derives token credit from this global structure. The edge capacities are reweighted to retain only the influence that can reach the answer region, while enforcing local flow conservation so intermediate tokens neither lose nor gain effective mass due to path length or irrelevant branches. On this graph, FlowTracer extracts an information-flow backbone connecting the question to the answer and scores tokens by flow throughput, revealing high-impact hubs and aggregation checkpoints that mediate long-range dependencies. These derived importances are used to shape token-level rewards, enabling learning signals to focus precisely on the tokens that route information toward (or away from) correct answers and delivering consistent performance gains across a range of reasoning tasks.