回声为锚:大语言模型推理中的概率成本与注意力重聚焦
Echoes as Anchors: Probabilistic Costs and Attention Refocusing in LLM Reasoning
February 6, 2026
作者: Zhuoyuan Hao, Zhuo Li, Wu Li, Fangming Liu, Min Zhang, Jing Li
cs.AI
摘要
大型推理模型中的测试时计算分配被广泛应用于数学问题求解、代码合成和规划等领域。近期研究通过扩展自洽性与并行思维、添加通用"思考标记"以及提示模型在作答前重读问题来解决该问题。然而这些方法要么注入与任务无关的标记,要么强制采用无法解释——且常常忽略——许多大型推理模型在内部推理链开头自发重复现象的启发式策略。与此相反,我们分析并利用模型重述问题的倾向(称之为提示回响/EOP),将其作为前置的计算塑形机制。通过将回响消除视为基于拒绝的条件化处理,并定义可计算的代理指标"回响似然间隙ΔL",我们形式化地量化了其概率成本。这为早期重复现象与似然增益及下游准确率之间建立了缺失的理论联系。但该理论本身并未阐明如何利用EOP。因此我们开发了回响蒸馏监督微调(ED-SFT),通过监督微调植入"先回响后推理"模式;同时提出回响提示法(EP),无需训练即可在推理过程中重新锚定模型。尽管前景可观,量化回响超越冗余性的实际效益仍非易事。为此我们进行了长度与后缀受控的似然分析,结合分层注意力研究,发现EOP能增强中间层对答案前缀的注意力,这与注意力重聚焦机制一致。在GSM8K、MathQA、Hendrycks-MATH、AIME24和MATH-500数据集上采用相同解码设置与计算预算的评估表明,该方法相较基线模型获得稳定提升。代码已开源:https://github.com/hhh2210/echoes-as-anchors。
English
Test-time compute allocation in large reasoning models (LRMs) is widely used and has applications in mathematical problem solving, code synthesis, and planning. Recent work has addressed this problem by scaling self-consistency and parallel thinking, adding generic ``thinking tokens'' and prompting models to re-read the question before answering. Unfortunately, these approaches either inject task-agnostic tokens or mandate heuristics that do not explain -- and often ignore -- the spontaneous repetition that many LRMs exhibit at the head of their internal chains. In contrast, we analyze and harness the model's tendency to restate the question, which we term the Echo of Prompt (EOP), as a front-loaded, compute-shaping mechanism. We formalize its probabilistic cost by casting echo removal as rejection-based conditioning and defining the Echo Likelihood Gap ΔL as a computable proxy. This provides the missing theoretical link that links early repetition to likelihood gains and downstream accuracy. However, it does not by itself specify how to exploit EOP. Consequently, we develop Echo-Distilled SFT (ED-SFT) to instill an ``echo-then-reason'' pattern through supervised finetuning, and Echoic Prompting (EP) to re-ground the model mid-trace without training. While promising, quantifying benefits beyond verbosity is non-trivial. Therefore, we conduct length and suffix-controlled likelihood analyses together with layer-wise attention studies, showing that EOP increases answer to answer-prefix attention in middle layers, consistent with an attention refocusing mechanism. We evaluate on GSM8K, MathQA, Hendrycks-MATH, AIME24, and MATH-500 under identical decoding settings and budgets, and find consistent gains over baselines. Code is available at https://github.com/hhh2210/echoes-as-anchors.