更深并非总是更好：通过置信层解码缓解对齐税

摘要

大语言模型（LLMs）的自回归生成通常从最终层进行译码，其假设是更深层的表征能产生更可靠的下一词预测。我们通过揭示一种循环的“猜测-精炼-扰动”动态来重新审视这一假设：早期层形成粗略猜测，中间层精炼与推理相关的语义，而最终层可能将这些精炼后的预测扰动为通用或符合对齐偏好的词元。我们提出了一种无需训练的译码策略——置信译码，该策略通过基于熵的保守后向搜索，动态选择最可靠的近最终层。我们进一步将层选择问题理论化为一个最优停止问题，表明在有界投影噪声和主导的后期对齐扰动下，我们的搜索规则能够过滤扰动，同时相对于理想精炼层的损失保持有界。在密集模型和混合专家大语言模型上的实验表明，该方法在具有挑战性的推理基准（包括GPQA-Diamond、Omni-MATH和HLE）上取得了持续改进，且无内存开销，延迟增加不到2%。这些结果表明，动态绕过最终层的扰动可以激发对齐大语言模型中更强的推理能力。

English

Autoregressive generation in large language models (LLMs) conventionally decodes from the final layer, assuming that deeper representations yield more reliable next-token predictions. We revisit this assumption by revealing a recurring Guess-Refine-Perturb dynamic: early layers form coarse guesses, intermediate layers refine reasoning-relevant semantics, and final layers can perturb these refined predictions toward generic or alignment-preferred tokens. We introduce Confident Decoding, a training-free decoding strategy that dynamically selects the most reliable near-final layer through entropy-guided conservative backward search. We further provide a theoretical formulation of layer selection as an optimal stopping problem, showing that under bounded projection noise and dominant late-stage alignment perturbation, our search rule filters perturbation while bounding the loss relative to the oracle refinement layer. Experiments across dense and Mixture-of-Experts LLMs demonstrate consistent gains on challenging reasoning benchmarks, including GPQA-Diamond, Omni-MATH, and HLE, with zero memory overhead and less than 2% latency increase. These results suggest dynamically bypassing final-layer perturbations can unlock stronger reasoning behavior from aligned LLMs.