準二次アーキテクチャについて：応用から原理へ

要旨

トランスフォーマーは現代の系列モデリングを支配しているが、その二次的な注意機構は多大な計算コストを伴う。準二次アーキテクチャはスケーラブルな代替手段を提供する。しかし、どの設計が最も効果的な系列モデルを生み出すかは依然として不明である。我々はxLSTM、Mamba-2、Gated DeltaNetという3つの主要な手法を比較する。これらのモデルを複雑な依存関係を持つタスク（(1)コードモデルの事前学習、(2)大規模言語モデルからのコードモデルの蒸留、(3)時系列基盤モデルの事前学習）で評価した。これらの設定全体において、xLSTMが最も高い全体的な性能を示す。xLSTMの優位性を説明するために、我々は統一的定式化を提示し、状態追跡と記憶ダイナミクスに焦点を当てて基礎となるアーキテクチャメカニズムを分析する。我々の結果は、xLSTMがそのゲーティング方式により、より柔軟で安定した記憶修正を可能にすることを示している。我々はこれらの知見を、制御された合成長汎化タスクで裏付ける。全体として、我々の発見は、xLSTMの複雑なタスクにおける利得が、頑健な状態追跡と蓄積に由来することを示している。

English

Transformers dominate modern sequence modeling, but their quadratic attention incurs substantial computational cost. Subquadratic architectures offer a scalable alternative. However, it remains unclear which designs yield the most effective sequence models. We compare three leading approaches: xLSTM, Mamba-2, and Gated DeltaNet. We evaluate these models on tasks with complex dependencies: (1) code-model pre-training, (2) distillation of code models from large language models, and (3) pre-training of time-series foundation models. Across these settings, xLSTM delivers the strongest overall performance. To explain xLSTM's advantage, we present a unified formulation and analyze the underlying architectural mechanisms, focusing on state tracking and memory dynamics. Our results show that xLSTM enables more flexible and stable memory correction via its gating scheme. We corroborate these findings on controlled synthetic length-generalization tasks. Overall, our findings indicate that xLSTM's gains on complex tasks stem from robust state tracking and accumulation.