后层归一化强势回归:稳定、表达力强且深度优化
Post-LayerNorm Is Back: Stable, ExpressivE, and Deep
January 27, 2026
作者: Chen Chen, Lai Wei
cs.AI
摘要
大规模语言模型(LLM)的扩展正面临瓶颈。拓宽模型宽度带来的收益递减,延长上下文长度也无法提升根本表达能力。相比之下,深度扩展理论上具有更优的表达能力,但当前Transformer架构在极端深度下难以稳定训练。我们重新审视后层归一化(Post-LN)结构——其在大规模训练中的不稳定性导致现代LLM普遍采用前层归一化(Pre-LN)作为替代。研究发现Post-LN的核心失效模式源于ResNet风格的残差路径,该路径会导致深度网络中的梯度消失。我们提出Keel模型,这是一种采用高速公路式(Highway-style)连接替代传统残差路径的Post-LN Transformer。这种改进能保持残差分支的梯度流动,防止顶层信号向底层传递时消失。与现有方法不同,Keel无需特殊初始化或复杂优化技巧即可实现极端深度下的稳定训练。该模型在超过1000层的深度下仍能稳健训练,并在困惑度和深度扩展特性上持续优于Pre-LN。这些发现表明,当Post-LN与高速公路式连接结合时,可为构建深度可扩展的LLM提供简单有效的基座,为未来无限深度架构开辟了可能性。
English
Large language model (LLM) scaling is hitting a wall. Widening models yields diminishing returns, and extending context length does not improve fundamental expressivity. In contrast, depth scaling offers theoretically superior expressivity, yet current Transformer architectures struggle to train reliably at extreme depths. We revisit the Post-LayerNorm (Post-LN) formulation, whose instability at scale caused its replacement by Pre-LN in modern LLMs. We show that the central failure mode of Post-LN arises from the ResNet-style residual pathway, which introduces gradient vanishing in deep networks. We present Keel, a Post-LN Transformer that replaces this residual path with a Highway-style connection. This modification preserves the gradient flow through the residual branch, preventing signal vanishing from the top layers to the bottom. Unlike prior methods, Keel enables stable training at extreme depths without requiring specialized initialization or complex optimization tricks. Keel trains robustly at depths exceeding 1000 layers and consistently improves perplexity and depth-scaling characteristics over Pre-LN. These findings indicate that Post-LN, when paired with a Highway-style connection, provides a simple and effective foundation for building deeply scalable LLMs, opening the possibility for future infinite-depth architectures.