深ければ良いというわけではない：確信層デコーディングによるアライメント税の軽減

要旨

大規模言語モデル（LLM）における自己回帰生成では、従来、より深い表現ほど信頼性の高い次トークン予測が得られるという仮定に基づき、最終層からデコードが行われてきた。本研究では、この仮定を再検討し、繰り返し現れる「推測-精緻化-摂動（Guess-Refine-Perturb）」のダイナミクスを明らかにする。すなわち、初期層は粗い推測を形成し、中間層は推論に関連する意味表現を精緻化する。一方、最終層はこれらの精緻化された予測を、一般的なトークンやアライメント選好的なトークンへと摂動させることがある。我々は、訓練を必要としないデコード戦略であるConfident Decodingを導入する。これは、エントロピー誘導による保守的後方探索を通じて、最も信頼性の高い最終層に近い層を動的に選択するものである。さらに、層選択を最適停止問題として理論的に定式化し、有界な射影ノイズと支配的な後期アライメント摂動の下で、本探索ルールが摂動を除去しつつ、理想的な精緻化層に対する損失を有界に保つことを示す。高密度およびMixture-of-Experts LLMを用いた実験では、GPQA-Diamond、Omni-MATH、HLEといった難易度の高い推論ベンチマークにおいて、メモリオーバーヘッドがゼロでレイテンシ増加が2%未満としながら、一貫した性能向上が確認された。これらの結果は、最終層の摂動を動的に回避することで、アライメントされたLLMからより強力な推論動作を引き出せる可能性を示唆している。

English

Autoregressive generation in large language models (LLMs) conventionally decodes from the final layer, assuming that deeper representations yield more reliable next-token predictions. We revisit this assumption by revealing a recurring Guess-Refine-Perturb dynamic: early layers form coarse guesses, intermediate layers refine reasoning-relevant semantics, and final layers can perturb these refined predictions toward generic or alignment-preferred tokens. We introduce Confident Decoding, a training-free decoding strategy that dynamically selects the most reliable near-final layer through entropy-guided conservative backward search. We further provide a theoretical formulation of layer selection as an optimal stopping problem, showing that under bounded projection noise and dominant late-stage alignment perturbation, our search rule filters perturbation while bounding the loss relative to the oracle refinement layer. Experiments across dense and Mixture-of-Experts LLMs demonstrate consistent gains on challenging reasoning benchmarks, including GPQA-Diamond, Omni-MATH, and HLE, with zero memory overhead and less than 2% latency increase. These results suggest dynamically bypassing final-layer perturbations can unlock stronger reasoning behavior from aligned LLMs.