더 깊다고 항상 더 나은 것은 아니다: 신뢰 계층 디코딩을 통한 정렬 비용 완화

초록

대규모 언어 모델(LLM)에서의 자기회귀적 생성은 일반적으로 최종 계층에서 디코딩하는데, 이는 더 깊은 표현이 더 신뢰할 수 있는 다음 토큰 예측을 제공한다는 가정에 기반한다. 우리는 반복되는 추측-정제-교란(Guess-Refine-Perturb) 동역학을 밝혀냄으로써 이 가정을 재검토한다. 초기 계층은 대략적인 추측을 형성하고, 중간 계층은 추론 관련 의미를 정제하며, 최종 계층은 이러한 정제된 예측을 일반적이거나 정렬 선호 토큰 쪽으로 교란할 수 있다. 우리는 엔트로피 기반 보수적 역방향 검색을 통해 가장 신뢰할 수 있는 최종 근처 계층을 동적으로 선택하는 훈련 없는 디코딩 전략인 Confident Decoding을 소개한다. 또한 계층 선택을 최적 정지 문제로 이론적으로 정식화하여, 제한된 투영 잡음과 지배적인 후기 단계 정렬 교란 하에서 우리의 검색 규칙이 오라클 정제 계층 대비 손실을 제한하면서 교란을 필터링함을 보인다. 밀집 및 혼합 전문가(Mixture-of-Experts) LLM에 걸친 실험은 GPQA-Diamond, Omni-MATH, HLE를 포함한 도전적인 추론 벤치마크에서 일관된 성능 향상을 보여주며, 메모리 오버헤드는 전혀 없고 지연 시간 증가는 2% 미만이다. 이러한 결과는 최종 계층 교란을 동적으로 우회함으로써 정렬된 LLM에서 더 강력한 추론 동작을 이끌어낼 수 있음을 시사한다.

English

Autoregressive generation in large language models (LLMs) conventionally decodes from the final layer, assuming that deeper representations yield more reliable next-token predictions. We revisit this assumption by revealing a recurring Guess-Refine-Perturb dynamic: early layers form coarse guesses, intermediate layers refine reasoning-relevant semantics, and final layers can perturb these refined predictions toward generic or alignment-preferred tokens. We introduce Confident Decoding, a training-free decoding strategy that dynamically selects the most reliable near-final layer through entropy-guided conservative backward search. We further provide a theoretical formulation of layer selection as an optimal stopping problem, showing that under bounded projection noise and dominant late-stage alignment perturbation, our search rule filters perturbation while bounding the loss relative to the oracle refinement layer. Experiments across dense and Mixture-of-Experts LLMs demonstrate consistent gains on challenging reasoning benchmarks, including GPQA-Diamond, Omni-MATH, and HLE, with zero memory overhead and less than 2% latency increase. These results suggest dynamically bypassing final-layer perturbations can unlock stronger reasoning behavior from aligned LLMs.