思維錨點:哪些大語言模型推理步驟至關重要?
Thought Anchors: Which LLM Reasoning Steps Matter?
June 23, 2025
作者: Paul C. Bogdan, Uzay Macar, Neel Nanda, Arthur Conmy
cs.AI
摘要
推理型大型語言模型近期在多個領域取得了最先進的表現。然而,其長鏈式思維推理過程帶來了可解釋性挑戰,因為每個生成的標記都依賴於之前的所有標記,使得計算更難分解。我們認為,在句子層面分析推理軌跡是理解推理過程的一種有前景的方法。我們提出了三種互補的歸因方法:(1) 一種黑箱方法,通過比較模型生成特定句子或含義不同句子時的100次模擬結果,來衡量每個句子的反事實重要性;(2) 一種白箱方法,通過聚合句子對之間的注意力模式,識別出那些通過「接收」注意力頭從所有後續句子獲得不成比例關注的「廣播」句子;(3) 一種因果歸因方法,通過抑制對某一句子的注意力,並測量其對每個後續句子標記的影響,來衡量句子間的邏輯聯繫。每種方法都為思維錨點的存在提供了證據,這些思維錨點是具有超常重要性並對後續推理過程產生不成比例影響的推理步驟。這些思維錨點通常是規劃或回溯句子。我們提供了一個開源工具(www.thought-anchors.com)用於可視化我們方法的輸出,並展示了一個案例研究,顯示了跨方法的一致性模式,這些模式映射了模型如何執行多步推理。方法間的一致性證明了句子層面分析在深入理解推理模型方面的潛力。
English
Reasoning large language models have recently achieved state-of-the-art
performance in many fields. However, their long-form chain-of-thought reasoning
creates interpretability challenges as each generated token depends on all
previous ones, making the computation harder to decompose. We argue that
analyzing reasoning traces at the sentence level is a promising approach to
understanding reasoning processes. We present three complementary attribution
methods: (1) a black-box method measuring each sentence's counterfactual
importance by comparing final answers across 100 rollouts conditioned on the
model generating that sentence or one with a different meaning; (2) a white-box
method of aggregating attention patterns between pairs of sentences, which
identified ``broadcasting'' sentences that receive disproportionate attention
from all future sentences via ``receiver'' attention heads; (3) a causal
attribution method measuring logical connections between sentences by
suppressing attention toward one sentence and measuring the effect on each
future sentence's tokens. Each method provides evidence for the existence of
thought anchors, reasoning steps that have outsized importance and that
disproportionately influence the subsequent reasoning process. These thought
anchors are typically planning or backtracking sentences. We provide an
open-source tool (www.thought-anchors.com) for visualizing the outputs of our
methods, and present a case study showing converging patterns across methods
that map how a model performs multi-step reasoning. The consistency across
methods demonstrates the potential of sentence-level analysis for a deeper
understanding of reasoning models.