ポスト・レイヤーノルムの復権：安定性、表現力、深層化を実現

要旨

大規模言語モデル（LLM）のスケーリングは限界に直面している。モデルの幅を広げても収穫逓減が生じ、文脈長を延ばしても本質的な表現力は向上しない。一方、深さ方向のスケーリングは理論上優れた表現力を提供するが、現在のTransformerアーキテクチャでは極端な深さでの安定した学習が困難である。本研究では、大規模化における不安定性から現代のLLMではPre-LNに置き換えられたPost-LayerNorm（Post-LN）の定式化を再検討する。Post-LNの主要な失敗モードは、ResNetスタイルの残差経路に起因しており、深層ネットワークで勾配消失を引き起こすことを明らかにする。我々は、この残差経路をHighwayスタイルの接続に置き換えたPost-LN Transformer「Keel」を提案する。この修正により、残差分岐を通じた勾配の流れが維持され、上位層から下位層への信号消失が防止される。従来手法とは異なり、Keelは特殊な初期化や複雑な最適化手法を必要とせず、極深度での安定した学習を可能にする。Keelは1000層を超える深さで頑健に学習し、Pre-LNと比較してパープレキシティと深さスケーリング特性を一貫して改善する。これらの知見は、Highwayスタイルの接続と組み合わせたPost-LNが、深層スケーラブルなLLM構築のための簡潔かつ効果的な基盤を提供し、将来の無限深度アーキテクチャの可能性を開くことを示唆している。

English

Large language model (LLM) scaling is hitting a wall. Widening models yields diminishing returns, and extending context length does not improve fundamental expressivity. In contrast, depth scaling offers theoretically superior expressivity, yet current Transformer architectures struggle to train reliably at extreme depths. We revisit the Post-LayerNorm (Post-LN) formulation, whose instability at scale caused its replacement by Pre-LN in modern LLMs. We show that the central failure mode of Post-LN arises from the ResNet-style residual pathway, which introduces gradient vanishing in deep networks. We present Keel, a Post-LN Transformer that replaces this residual path with a Highway-style connection. This modification preserves the gradient flow through the residual branch, preventing signal vanishing from the top layers to the bottom. Unlike prior methods, Keel enables stable training at extreme depths without requiring specialized initialization or complex optimization tricks. Keel trains robustly at depths exceeding 1000 layers and consistently improves perplexity and depth-scaling characteristics over Pre-LN. These findings indicate that Post-LN, when paired with a Highway-style connection, provides a simple and effective foundation for building deeply scalable LLMs, opening the possibility for future infinite-depth architectures.

ポスト・レイヤーノルムの復権：安定性、表現力、深層化を実現

Post-LayerNorm Is Back: Stable, ExpressivE, and Deep

要旨

Support