高次線形注意機構

要旨

スケーリングされた内積注意の二次コストは、長文コンテキストへの自己回帰言語モデルの拡張における主要な障壁である。線形時間注意と状態空間モデル（SSM）はスケーラブルな代替手段を提供するが、通常は一次近似またはカーネルベースの近似に制限されており、表現力が制限される可能性がある。本論文では、高次線形注意（HLA）を提案する。これは、コンパクトな接頭辞十分統計量を介して高次の相互作用を実現する、因果的かつストリーミングのメカニズムである。二次の場合、HLAは一定サイズの状態を維持し、n×n行列を一切具体化することなく、トークンごとの出力を線形時間で計算する。我々は、閉形式のストリーミング恒等式、追加の2つの要約統計量を用いた厳密に因果的なマスク変種、および逐次的反復の活性化を正確に再現する結合スキャンに基づくチャンク並列訓練スキームを提示する。さらに、三次および更高次への拡張の概要を示す。総合的に、これらの結果は、HLAを、注意のようなデータ依存の混合と現代的なリカレント構造の効率性を組み合わせた、原理的でスケーラブルな構成要素として位置づける。プロジェクトページ: https://github.com/yifanzhang-pro/HLA

English

The quadratic cost of scaled dot-product attention is a central obstacle to scaling autoregressive language models to long contexts. Linear-time attention and State Space Models (SSMs) provide scalable alternatives but are typically restricted to first-order or kernel-based approximations, which can limit expressivity. We introduce Higher-order Linear Attention (HLA), a causal, streaming mechanism that realizes higher interactions via compact prefix sufficient statistics. In the second-order case, HLA maintains a constant-size state and computes per-token outputs in linear time without materializing any n times n matrices. We give closed-form streaming identities, a strictly causal masked variant using two additional summaries, and a chunk-parallel training scheme based on associative scans that reproduces the activations of a serial recurrence exactly. We further outline extensions to third and higher orders. Collectively, these results position HLA as a principled, scalable building block that combines attention-like, data-dependent mixing with the efficiency of modern recurrent architectures. Project Page: https://github.com/yifanzhang-pro/HLA.