テーパー言語モデル

要旨

近年の言語モデル（Transformer、再帰型、記憶ベースの派生型を含む）は、深さ方向に均一にパラメータが割り当てられた同一構成の層を積み重ねるという共通の基本構造を採用している。これはオリジナルのTransformerから継承され、その後ほとんど変更されていないデフォルトの設計である。しかし、層ごとに最終出力への寄与が不均一であり、後段の層は残差ストリームを変換するのではなく洗練（リファイン）する傾向があることを示す証拠が蓄積されつつある。本研究では、この非対称性をパラメータ容量の配分に反映すべきかを問う。制御実験の結果、一定の予算制約下で、前段の層により多くの容量を、後段の層により少ない容量を割り当てると、均一幅のベースラインと比較してパープレキシティが改善される一方、逆の割り当ては性能を損なうことが示された。この結果を基に、一定の総予算のもとでパラメータを担う構成要素を深さ方向に単調にテーパリング（先細り）するアーキテクチャ原理である「テーパード言語モデル（TLM）」を提案する。MLP層はこの適用に最も適している。なぜなら、MLPは現代のあらゆる言語モデルファミリーにおいてパラメータ数の大半を占め、その幅という単一かつ明確な軸で変化を加えられるからである。3つのモデル規模と4つのアーキテクチャ（Transformer、Gated Attention、Hope-attention、Titans）において、滑らかなコサインスケジュールでMLP幅をテーパリングすることで、パラメータ数や計算コストを増やすことなく、均一幅のベースラインと比較してパープレキシティと下流ベンチマーク性能が一貫して向上した。これらの知見は、深さを考慮した容量配分が、アーキテクチャに依存しないシンプルな言語モデル設計の軸であり、目に見えているのに見過ごされていた自由なレバーであることを示している。

English

Modern language models, including transformer, recurrent, and memory-based variants, share a common chassis: a stack of identical layers in which parameters are allocated uniformly across depth. This is a default inherited from the original transformer and largely unchanged since, yet a growing body of evidence suggests that layers contribute non-uniformly to the final output, with later layers refining the residual stream rather than transforming it. We ask whether parameter capacity should reflect this asymmetry. Our controlled experiment shows that, under a fixed budget, allocating more capacity to earlier layers and less to later layers improves perplexity over a uniform-width baseline, while the reverse allocation hurts. Building on this result, we introduce Tapered Language Models (TLMs), an architectural principle in which a parameter-bearing component is monotonically tapered across depth under a fixed total budget. MLPs are the natural site for this instantiation: they dominate parameter count across all modern LM families and expose width as a single, clean axis of variation. Across three model scales and four architectures (Transformer, Gated Attention, Hope-attention, and Titans), tapering MLP width via a smooth cosine schedule consistently improves perplexity and downstream benchmark performance over uniform baselines, at no additional parameter or compute cost. These findings establish depth-aware capacity allocation as a simple, architecture-agnostic axis of language model design, a free lever hidden in plain sight.