Parallax: 言語モデリングのためのパラメータ化された局所線形アテンション

要旨

大規模言語モデル（LLM）は人工知能の中心的パラダイムとなっているが、その核となる計算プリミティブであるアテンションは構造的に変化していない。Local Linear Attention（LLA）は、テスト時回帰フレームワークにおけるノンパラメトリック統計に由来するアテンション機構である。効率的なアテンションの変種に関する先行研究とは対照的に、LLAはsoftmaxアテンションにおける局所定数推定を局所線形推定に拡張し、連想記憶に対して理論的に優れたバイアス・バリアンストレードオフを実現する。しかしLLAは、計算上の問題や数値的安定性の懸念から、LLMの事前学習において大規模化されていなかった。我々は、LLMにスケーラブルなパラメータ化Local Linear AttentionであるParallaxを導入する。ParallaxはLLAにおける数値解法を排除し、KV共分散を調査する追加のクエリ的プロジェクタを学習する。我々はParallaxを、バンド幅、プローブ構築、アフィン構造によって結びつけられた一連のアテンション機構ファミリーの中に位置づける。FlashAttentionと比較して演算強度を高め、アテンションをより計算バウンドな領域へと移行させるハードウェア認識アルゴリズムを提案する。我々のプロトタイプデコードカーネルは、多様なバッチサイズとコンテキスト長において、FlashAttention 2/3と同等以上の性能を示す。Parallaxを0.6Bパラメータおよび1.7Bパラメータで事前学習し、事前学習全体を通じて一貫したパープレキシティの改善を確認し、その利得はダウンストリームベンチマークにも転移する。この優位性はパラメータマッチングと計算量マッチングの両方の制御下で持続し、パレート改善を示している。我々は慎重な事前学習アブレーションを実施し、MuonがParallaxの能力を解放する新規な現象を特定した。本論文の知る限り、これはアーキテクチャ研究文献において、アテンション機構に対する強力なアーキテクチャ・最適化器の共同設計を実証した最初の結果である。

English

Large Language Models (LLMs) have become the central paradigm in artificial intelligence, yet the core computational primitive of attention has remained structurally unchanged. Local Linear Attention (LLA) is an attention mechanism derived from nonparametric statistics in the test-time regression framework. In contrast to prior research on efficient attention variants, LLA upgrades the local constant estimate in softmax attention to a local linear estimate, yielding provably superior bias-variance tradeoffs for associative memory. However, LLA has not been scaled in LLM pretraining due to computational and numerical stability concerns. We introduce Parallax, a parameterized Local Linear Attention that is scalable for LLMs. Parallax eliminates the numerical solver in LLA and learns an extra query-like projector that probes the KV covariance. We place Parallax within a family of attention mechanisms connected by the bandwidth, the probe construction and the affine structure. We propose a hardware-aware algorithm that increases the arithmetic intensity over FlashAttention, shifting attention into a more compute bound regime. Our prototype decode kernel matches or outperforms FlashAttention 2/3 across diverse batch sizes and context lengths. We pretrain Parallax at 0.6B and 1.7B scales and find consistent perplexity improvements throughout pretraining with gains that transfer to downstream benchmarks. The advantage persists under both parameter-matched and compute-matched controls, demonstrating a Pareto improvement. We perform careful pretraining ablations and identify a novel phenomenon whereby Muon unlocks the capacity of Parallax. To our knowledge, this is the first empirical demonstration of strong architecture-optimizer codesign for attention mechanisms in the architecture research literature.