思维锚点：大语言模型推理步骤中哪些是关键？

摘要

推理型大语言模型近期在众多领域取得了顶尖性能。然而，其长链式思维推理过程带来了可解释性挑战，因为每个生成的标记都依赖于之前的所有标记，使得计算过程难以分解。我们认为，在句子层面分析推理轨迹是理解推理过程的一种有前景的方法。我们提出了三种互补的归因方法：(1) 一种黑箱方法，通过比较模型生成特定句子或不同含义句子时的100次运行结果，衡量每个句子的反事实重要性；(2) 一种白箱方法，通过聚合句子对之间的注意力模式，识别出“广播”句子，这些句子通过“接收”注意力头从所有后续句子中获得不成比例的注意力；(3) 一种因果归因方法，通过抑制对某一句子的注意力并测量其对每个后续句子标记的影响，衡量句子间的逻辑联系。每种方法都为“思维锚点”的存在提供了证据，这些推理步骤具有超常的重要性，并对后续推理过程产生不成比例的影响。这些思维锚点通常是规划或回溯句子。我们提供了一个开源工具(www.thought-anchors.com)用于可视化我们方法的输出，并通过案例研究展示了跨方法的一致模式，这些模式映射了模型如何执行多步推理。方法间的一致性证明了句子层面分析在深入理解推理模型方面的潜力。

English

Reasoning large language models have recently achieved state-of-the-art performance in many fields. However, their long-form chain-of-thought reasoning creates interpretability challenges as each generated token depends on all previous ones, making the computation harder to decompose. We argue that analyzing reasoning traces at the sentence level is a promising approach to understanding reasoning processes. We present three complementary attribution methods: (1) a black-box method measuring each sentence's counterfactual importance by comparing final answers across 100 rollouts conditioned on the model generating that sentence or one with a different meaning; (2) a white-box method of aggregating attention patterns between pairs of sentences, which identified ``broadcasting'' sentences that receive disproportionate attention from all future sentences via ``receiver'' attention heads; (3) a causal attribution method measuring logical connections between sentences by suppressing attention toward one sentence and measuring the effect on each future sentence's tokens. Each method provides evidence for the existence of thought anchors, reasoning steps that have outsized importance and that disproportionately influence the subsequent reasoning process. These thought anchors are typically planning or backtracking sentences. We provide an open-source tool (www.thought-anchors.com) for visualizing the outputs of our methods, and present a case study showing converging patterns across methods that map how a model performs multi-step reasoning. The consistency across methods demonstrates the potential of sentence-level analysis for a deeper understanding of reasoning models.

思维锚点：大语言模型推理步骤中哪些是关键？

Thought Anchors: Which LLM Reasoning Steps Matter?

摘要

Support