セサ：選択的状態空間アテンション

要旨

現代の系列モデリングは、主に2つの系統が支配的である。1つはトランスフォーマーで、その自己注意機構は可視系列の任意の要素にアクセス可能である。もう1つは構造化状態空間モデルで、明示的な繰り返し状態を通じて情報を伝播する。これらのメカニズムは長文脈において異なる限界に直面する。注意が拡散している場合、個々のトークンの影響力は実効的なサポート範囲全体で希薄化する。一方、繰り返し状態の伝播は、情報が積極的に保持されない限り、長距離の感度を失う可能性がある。その結果、両メカニズムとも長文脈にわたる情報の保持と選択的検索において課題に直面する。我々は、注意機構を繰り返しフィードバック経路内に配置するデコーダー、Sessaを提案する。これにより、単一の注意読み出しや単一の繰り返し連鎖に依存するのではなく、過去のトークンが将来の状態に影響を与える多数の注意ベースの経路が創出される。明示的な仮定と同等の条件下で、Sessaが0 < β < 1におけるべき乗則メモリ尾部O(ℓ^{-β})を許容し、対応するトランスフォーマーおよびMambaスタイルのベースラインよりも減衰が遅いことを証明する。さらに、このべき乗則レートを達成する明示的な構成法を示す。同じ仮定の下で、Sessaは検討されたモデルクラスの中で唯一、距離とともに減衰しない影響プロファイルを含む、柔軟な選択的検索を実現する。この理論的利点と一致して、条件を揃えた実験全体において、Sessaは短文脈の言語モデリングではトランスフォーマーおよびMambaスタイルのベースラインと競合する性能を維持しつつ、長文脈ベンチマークで最も強力な性能を達成する。

English

Modern sequence modeling is dominated by two families: Transformers, whose self-attention can access arbitrary elements of the visible sequence, and structured state-space models, which propagate information through an explicit recurrent state. These mechanisms face different limitations on long contexts: when attention is diffuse, the influence of individual tokens is diluted across the effective support, while recurrent state propagation can lose long-range sensitivity unless information is actively preserved. As a result, both mechanisms face challenges in preserving and selectively retrieving information over long contexts. We propose Sessa, a decoder that places attention inside a recurrent feedback path. This creates many attention-based paths through which past tokens can influence future states, rather than relying on a single attention read or a single recurrent chain. We prove that, under explicit assumptions and matched regimes, Sessa admits power-law memory tails O(ell^{-β}) for 0 < β< 1, with slower decay than in the corresponding Transformer and Mamba-style baselines. We further give an explicit construction that achieves this power-law rate. Under the same assumptions, Sessa is the only model class among those considered that realizes flexible selective retrieval, including profiles whose influence does not decay with distance. Consistent with this theoretical advantage, across matched experiments, Sessa achieves the strongest performance on long-context benchmarks while remaining competitive with Transformer and Mamba-style baselines on short-context language modeling.

セサ：選択的状態空間アテンション

Sessa: Selective State Space Attention

要旨

Support