ハイブリッド線形アテンションの体系的解析

要旨

Transformerは長いシーケンスに対して二次的な計算複雑性とメモリ問題に直面しており、これが固定サイズの隠れ状態を使用する線形注意メカニズムの採用を促しています。しかし、線形モデルはしばしば限定的なリコール性能に悩まされ、その結果、線形注意層と完全注意層を組み合わせたハイブリッドアーキテクチャが生まれています。ハイブリッドアーキテクチャに関する研究は広範に行われていますが、線形注意コンポーネントの選択については深く探求されていません。我々は、ベクトル再帰から高度なゲーティングメカニズムまで、さまざまな世代の線形注意モデルを、スタンドアロンおよびハイブリッド化された形で体系的に評価します。この包括的な分析を可能にするため、我々は72のモデルをトレーニングし、オープンソース化しました：340Mパラメータ（20Bトークン）の36モデルと1.3Bパラメータ（100Bトークン）の36モデルで、5つのハイブリッド化比率にわたる6つの線形注意バリアントをカバーしています。標準的な言語モデリングとリコールタスクでのベンチマークにより、優れたスタンドアロンの線形モデルが必ずしもハイブリッドで優れているわけではないことが明らかになりました。言語モデリングは線形から完全注意比率にわたって安定していますが、リコールは完全注意層の増加、特に3:1以下の比率で大幅に改善されます。我々の研究は、選択的ゲーティング、階層的再帰、制御された忘却が効果的なハイブリッドモデルにとって重要であることを強調しています。我々は、HGRN-2やGatedDeltaNetなどのアーキテクチャを、3:1から6:1の線形対完全注意比率で使用することを推奨し、Transformerレベルのリコールを効率的に達成します。我々のモデルはhttps://huggingface.co/collections/m-a-p/hybrid-linear-attention-research-686c488a63d609d2f20e2b1eでオープンソース化されています。

English

Transformers face quadratic complexity and memory issues with long sequences, prompting the adoption of linear attention mechanisms using fixed-size hidden states. However, linear models often suffer from limited recall performance, leading to hybrid architectures that combine linear and full attention layers. Despite extensive hybrid architecture research, the choice of linear attention component has not been deeply explored. We systematically evaluate various linear attention models across generations - vector recurrences to advanced gating mechanisms - both standalone and hybridized. To enable this comprehensive analysis, we trained and open-sourced 72 models: 36 at 340M parameters (20B tokens) and 36 at 1.3B parameters (100B tokens), covering six linear attention variants across five hybridization ratios. Benchmarking on standard language modeling and recall tasks reveals that superior standalone linear models do not necessarily excel in hybrids. While language modeling remains stable across linear-to-full attention ratios, recall significantly improves with increased full attention layers, particularly below a 3:1 ratio. Our study highlights selective gating, hierarchical recurrence, and controlled forgetting as critical for effective hybrid models. We recommend architectures such as HGRN-2 or GatedDeltaNet with a linear-to-full ratio between 3:1 and 6:1 to achieve Transformer-level recall efficiently. Our models are open-sourced at https://huggingface.co/collections/m-a-p/hybrid-linear-attention-research-686c488a63d609d2f20e2b1e.

ハイブリッド線形アテンションの体系的解析

A Systematic Analysis of Hybrid Linear Attention

要旨

Support