アテンションのボトルネックを打破する

要旨

アテンションベースのトランスフォーマーは、長距離依存関係をモデル化し、可変長の入力シーケンスを処理する能力から、多くの深層学習分野で標準的なアーキテクチャとなっています。しかし、二次複雑性を持つアテンションメカニズムは、トランスフォーマーアーキテクチャにおける重要なボトルネックです。このアルゴリズムはデコーダーにおいて単方向であり、過剰パラメータ化されたデコーダーのみのモデルでは静的なパターンに収束します。私はこの問題に対処するため、アテンションまたは活性化の代替として生成関数を開発しました。これは各トークンを前のトークンと比較することで、依然として自己回帰的な特性を保持しています。nanoGPTを用いたテスト環境では、より小さなモデルでありながら、より小さい損失が得られました。さらに、平均コンテキストベクトルを組み込むことで、損失はさらに低下します。このアテンション代替の概念は、GNU AGPL v3ライセンスの下でhttps://gitlab.com/Bachstelze/causal_generationにて公開されています。

English

Attention-based transformers have become the standard architecture in many deep learning fields, primarily due to their ability to model long-range dependencies and handle variable-length input sequences. However, the attention mechanism with its quadratic complexity is a significant bottleneck in the transformer architecture. This algorithm is only uni-directional in the decoder and converges to a static pattern in over-parametrized decoder-only models. I address this issue by developing a generative function as attention or activation replacement. It still has the auto-regressive character by comparing each token with the previous one. In my test setting with nanoGPT this yields a smaller loss while having a smaller model. The loss further drops by incorporating an average context vector. This concept of attention replacement is distributed under the GNU AGPL v3 license at https://gitlab.com/Bachstelze/causal_generation.

アテンションのボトルネックを打破する

Breaking the Attention Bottleneck

要旨

Support