あなたのTransformerは密かに線形である

要旨

本論文は、GPT、LLaMA、OPT、BLOOMなどのモデルを含むトランスフォーマーデコーダーに固有の新たな線形特性を明らかにする。我々は連続する層間の埋め込み変換を分析し、ほぼ完璧な線形関係（Procrustes類似度スコア0.99）を発見した。しかし、トランスフォーマー層の出力ノルムが一貫して低いため、残差成分を除去すると線形性が低下する。実験では、トランスフォーマーの最も線形性の高いブロックを除去または線形近似しても、損失やモデルの性能に大きな影響がないことが示された。さらに、小規模モデルでの事前学習実験では、層の線形性を低減することを目的としたコサイン類似度ベースの正則化を導入した。この正則化により、Tiny StoriesやSuperGLUEなどのベンチマークで性能指標が向上し、モデルの線形性の低減にも成功した。本研究は、トランスフォーマーアーキテクチャの従来の理解に疑問を投げかけ、その動作がこれまで考えられていたよりも線形的である可能性を示唆している。

English

This paper reveals a novel linear characteristic exclusive to transformer decoders, including models such as GPT, LLaMA, OPT, BLOOM and others. We analyze embedding transformations between sequential layers, uncovering a near-perfect linear relationship (Procrustes similarity score of 0.99). However, linearity decreases when the residual component is removed due to a consistently low output norm of the transformer layer. Our experiments show that removing or linearly approximating some of the most linear blocks of transformers does not affect significantly the loss or model performance. Moreover, in our pretraining experiments on smaller models we introduce a cosine-similarity-based regularization, aimed at reducing layer linearity. This regularization improves performance metrics on benchmarks like Tiny Stories and SuperGLUE and as well successfully decreases the linearity of the models. This study challenges the existing understanding of transformer architectures, suggesting that their operation may be more linear than previously assumed.

あなたのTransformerは密かに線形である

Your Transformer is Secretly Linear

要旨

Support