後層歸一化回歸:穩定、具表現力且深度的模型
Post-LayerNorm Is Back: Stable, ExpressivE, and Deep
January 27, 2026
作者: Chen Chen, Lai Wei
cs.AI
摘要
大型語言模型的規模擴展正面臨瓶頸。增加模型寬度帶來的效益遞減,而擴展上下文長度也無法提升基礎表達能力。相比之下,深度擴展在理論上具有更優越的表達潛力,但當前的Transformer架構在極深層數下難以穩定訓練。我們重新審視後層歸一化(Post-LN)架構——該設計因大規模訓練不穩定性而被預層歸一化(Pre-LN)取代。我們發現Post-LN的核心失效模式源於ResNet風格的殘差路徑,這會導致深層網絡出現梯度消失問題。為此,我們提出Keel模型:一種採用高速公路式連接替代傳統殘差路徑的Post-LN Transformer。這種改進能維持殘差分支的梯度流動,防止頂層信號向底層傳播時消失。與既有方法不同,Keel無需特殊初始化或複雜優化技巧即可實現極深層數的穩定訓練。實驗表明,Keel在超過1000層的深度下仍能穩定訓練,並在困惑度與深度擴展特性上持續優於Pre-LN。這些發現證明,結合高速公路式連接的Post-LN架構能為構建深度可擴展的大型語言模型提供簡潔有效的基礎,為未來無限深度架構開闢了可能性。
English
Large language model (LLM) scaling is hitting a wall. Widening models yields diminishing returns, and extending context length does not improve fundamental expressivity. In contrast, depth scaling offers theoretically superior expressivity, yet current Transformer architectures struggle to train reliably at extreme depths. We revisit the Post-LayerNorm (Post-LN) formulation, whose instability at scale caused its replacement by Pre-LN in modern LLMs. We show that the central failure mode of Post-LN arises from the ResNet-style residual pathway, which introduces gradient vanishing in deep networks. We present Keel, a Post-LN Transformer that replaces this residual path with a Highway-style connection. This modification preserves the gradient flow through the residual branch, preventing signal vanishing from the top layers to the bottom. Unlike prior methods, Keel enables stable training at extreme depths without requiring specialized initialization or complex optimization tricks. Keel trains robustly at depths exceeding 1000 layers and consistently improves perplexity and depth-scaling characteristics over Pre-LN. These findings indicate that Post-LN, when paired with a Highway-style connection, provides a simple and effective foundation for building deeply scalable LLMs, opening the possibility for future infinite-depth architectures.