Transformer言語モデルの形状慣行の再検討

要旨

高密度Transformer言語モデルは、これまで一貫したアーキテクチャ形状を維持してきた。すなわち、各層は注意機構モジュールと、拡大率2〜4でほとんどのパラメータをMLPに割り当てた狭-広-狭構造のMLPを持つフィードフォワードネットワーク（FFN）で構成される。近年、残差接続を持つ広-狭-広（砂時計型）MLPが優れた関数近似能力を発揮することが示されたことを受け、本研究ではTransformerの長年続くMLP形状の慣例を見直し、狭-広-狭設計の必要性に疑問を投げかける。これを検証するため、従来のFFNを、残差経路で接続された砂時計型サブMLPのスタックから構成される、より深い砂時計形状のFFNに置き換えたTransformer変種を開発する。我々は、より深層化され軽量な砂時計型FFNが従来型FFNの競合代替となり得ること、また砂時計型FFNの軽量化で節約されたパラメータを（固定予算下でモデルの隠れ次元を拡大するなど）より効果的に活用できることを仮説として提示する。モデル規模を跨いだ実証実験によりこれらを確認した：砂時計型FFNは4億パラメータ規模まで従来型FFNを上回り、10億パラメータ規模でも同等の性能を達成した；FFNパラメータを削減し注意機構のパラメータを増やした砂時計型FFN変種は、同等予算の従来構成に対して一貫した改善を示した。これらの知見は、最近の研究に新たな光を当て、狭-広-狭MLPの慣例と、効率的で表現力の高い現代的な言語モデルを目指す注意機構とFFNのバランスの再考を促すものである。

English

Dense Transformer language models have largely adhered to one consistent architectural shape: each layer consists of an attention module followed by a feed-forward network (FFN) with a narrow-wide-narrow MLP, allocating most parameters to the MLP at expansion ratios between 2 and 4. Motivated by recent results that residual wide-narrow-wide (hourglass) MLPs offer superior function approximation capabilities, we revisit the long-standing MLP shape convention in Transformer, challenging the necessity of the narrow-wide-narrow design. To study this, we develop a Transformer variant that replaces the conventional FFN with a deeper hourglass-shaped FFN, comprising a stack of hourglass sub-MLPs connected by residual pathways. We posit that a deeper but lighter hourglass FFN can serve as a competitive alternative to the conventional FFN, and that parameters saved by using a lighter hourglass FFN can be more effectively utilized, such as by enlarging model hidden dimensions under fixed budgets. We confirm these through empirical validations across model scales: hourglass FFNs outperform conventional FFNs up to 400M and achieve comparable performance at larger scales to 1B parameters; hourglass FFN variants with reduced FFN and increased attention parameters show consistent improvements over conventional configurations at matched budgets. Together, these findings shed new light on recent work and prompt a rethinking of the narrow-wide-narrow MLP convention and the balance between attention and FFN towards efficient and expressive modern language models.

Transformer言語モデルの形状慣行の再検討

Revisiting the Shape Convention of Transformer Language Models

要旨

Support