マルチレイヤートランスフォーマーの勾配は、ほぼ線形の時間で近似することができます。

要旨

人気のあるトランスフォーマーアーキテクチャの自己注意メカニズムにおける二次計算複雑性は、効率性とメモリ要件の観点から、特にトレーニングと推論において重要な課題を提起しています。これらの課題に対処するため、本論文では、マルチレイヤートランスフォーマーモデルにおける勾配計算のための革新的な高速計算方法を紹介します。当手法により、入力シーケンスの長さを表す n に対して、ほぼ線形時間 n^{1+o(1)} でマルチレイヤートランスフォーマーモデル全体の勾配計算が可能となります。このブレークスルーにより、従来の二次時間複雑性に関連する計算的ボトルネックが大幅に軽減されます。当理論は任意の損失関数に対して成立し、全モデル全体で境界の近似誤差を維持します。さらに、当マルチレイヤートランスフォーマーモデルがリジュアル接続、カジュアルマスク、マルチヘッドアテンションなど多くの実用的なサブモジュールを含む場合でも、当分析は成立します。大規模言語モデルにおける勾配計算の効率を向上させることで、当研究が、理論的結果に基づく長いコンテキストの言語モデルのより効果的なトレーニングと展開を促進することを期待しています。

English

The quadratic computational complexity in the self-attention mechanism of popular transformer architectures poses significant challenges for training and inference, particularly in terms of efficiency and memory requirements. Towards addressing these challenges, this paper introduces a novel fast computation method for gradient calculation in multi-layer transformer models. Our approach enables the computation of gradients for the entire multi-layer transformer model in almost linear time n^{1+o(1)}, where n is the input sequence length. This breakthrough significantly reduces the computational bottleneck associated with the traditional quadratic time complexity. Our theory holds for any loss function and maintains a bounded approximation error across the entire model. Furthermore, our analysis can hold when the multi-layer transformer model contains many practical sub-modules, such as residual connection, casual mask, and multi-head attention. By improving the efficiency of gradient computation in large language models, we hope that our work will facilitate the more effective training and deployment of long-context language models based on our theoretical results.

マルチレイヤートランスフォーマーの勾配は、ほぼ線形の時間で近似することができます。

Multi-Layer Transformers Gradient Can be Approximated in Almost Linear Time

要旨

Support