추론은 어떻게 흐르는가? 대규모 언어 모델에서 표적 강화 학습을 위한 주의 집중 유도 정보 흐름 추적

초록

토큰 수준의 신용 할당은 대규모 언어 모델(LLM)에서 강화 학습(RL)의 주요 장애물로 남아 있으며, 기존 RL 방법은 일반적으로 모든 토큰을 동등하게 처리하여 결정적인 추론 단계와 일상적인 서식 또는 유창한 채우기를 구분하지 못한다. 최근 접근법은 모델 내부 신호를 활용하여 더 세분화된 신용을 할당하려 하지만, 이는 종종 정보 전파의 전역 구조를 무시하는 점별 휴리스틱에 불과하다. 본 논문에서는 FlowTracer를 제안한다. 이는 주의(attention) 기반 방향성 비순환 그래프(DAG)에서 답변 지향 추론 흐름을 추적하는 RL 프레임워크로, 노드는 토큰에 해당하고 가장자리 용량은 집계된 주의 가중치에서 비롯되며, 이 전역 구조로부터 토큰 신용을 도출한다. 가장자리 용량은 답변 영역에 도달할 수 있는 영향만 유지하도록 재가중되며, 국소 흐름 보존을 강제하여 중간 토큰이 경로 길이나 관련 없는 가지로 인해 유효 질량을 잃거나 얻지 않도록 한다. 이 그래프에서 FlowTracer는 질문과 답변을 연결하는 정보 흐름 백본을 추출하고, 흐름 처리량에 따라 토큰에 점수를 매겨 장거리 의존성을 매개하는 영향력이 큰 허브와 집계 체크포인트를 드러낸다. 이러한 도출된 중요도는 토큰 수준 보상을 형성하는 데 사용되어 학습 신호가 정보를 정답 쪽으로 (또는 정답에서 멀어지게) 라우팅하는 토큰에 정확히 집중할 수 있게 하며, 다양한 추론 작업에서 일관된 성능 향상을 제공한다.

English

Token-level credit assignment remains a key obstacle for reinforcement learning (RL) in large language models (LLMs), where RL recipes typically treat all tokens equally, failing to distinguish decisive reasoning steps from routine formatting or fluent filler. Recent attempts leverage model-internal signals to assign finer-grained credit, but these are often point-wise heuristics that ignore the global structure of information propagation. We propose FlowTracer, an RL framework that traces answer-targeted reasoning flow on an attention-induced directed acyclic graph in which nodes correspond to tokens and edge capacities come from aggregated attention weights and derives token credit from this global structure. The edge capacities are reweighted to retain only the influence that can reach the answer region, while enforcing local flow conservation so intermediate tokens neither lose nor gain effective mass due to path length or irrelevant branches. On this graph, FlowTracer extracts an information-flow backbone connecting the question to the answer and scores tokens by flow throughput, revealing high-impact hubs and aggregation checkpoints that mediate long-range dependencies. These derived importances are used to shape token-level rewards, enabling learning signals to focus precisely on the tokens that route information toward (or away from) correct answers and delivering consistent performance gains across a range of reasoning tasks.