Transformerの中間層をスキップする学習

要旨

条件付き計算は、Transformerをより効率的にするための一般的な戦略です。既存の手法は、個々のモジュール（例えば、専門家混合層）を対象とするか、層を独立してスキップすることが多いです。しかし、解釈可能性の研究により、Transformerの中間層はより冗長性が高く、初期層は情報をトークンの位置に集約することが示されています。これらの知見を基に、我々は中間層から外側に向かって可変数の層を動的にスキップする新しいアーキテクチャを提案します。具体的には、学習されたゲーティング機構が入力に基づいて中央ブロックの対称的な範囲をバイパスするかどうかを決定し、ゲーティングされたアテンション機構が後続のトークンがスキップされたトークン位置に注意を向けることを防ぎます。残差ノルムは「サンドイッチ」または「perilayernorm」スキームで制御され、ゲートのスパース性は適応的正則化損失で制御されます。我々は「より単純な」トークンの計算要件を削減し、潜在的に多段階の表現階層を出現させることを目指しましたが、調査したスケールでは、層数が少ない密なベースラインと比較して、検証クロスエントロピーと推定FLOPsのトレードオフにおいて改善は達成されませんでした。コードはhttps://github.com/tim-lawson/skip-middleで公開しています。

English

Conditional computation is a popular strategy to make Transformers more efficient. Existing methods often target individual modules (e.g., mixture-of-experts layers) or skip layers independently of one another. However, interpretability research has demonstrated that the middle layers of Transformers exhibit greater redundancy, and that early layers aggregate information into token positions. Guided by these insights, we propose a novel architecture that dynamically skips a variable number of layers from the middle outward. In particular, a learned gating mechanism determines whether to bypass a symmetric span of central blocks based on the input, and a gated attention mechanism prevents subsequent tokens from attending to skipped token positions. Residual norms are controlled with a 'sandwich' or 'perilayernorm' scheme and gate sparsity with an adaptive regularization loss. We had aimed to reduce compute requirements for 'simpler' tokens and potentially foster an emergent multi-level representational hierarchy but, at the scales investigated, our approach does not achieve improvements in the trade-off between validation cross-entropy and estimated FLOPs compared to dense baselines with fewer layers. We release our code at https://github.com/tim-lawson/skip-middle.

Transformerの中間層をスキップする学習

Learning to Skip the Middle Layers of Transformers

要旨

Support