论次二次架构:从应用到原理
On Subquadratic Architectures: From Applications to Principles
June 10, 2026
作者: Anamaria-Roberta Hartl, Levente Zólyomi, David Stap, Pieter-Jan Hoedt, Niklas Schmidinger, Lukas Hauzenberger, Sebastian Böck, Günter Klambauer, Sepp Hochreiter
cs.AI
摘要
Transformer在现代序列建模中占据主导地位,但其二次注意力机制带来了显著的计算开销。次二次架构提供了一种可扩展的替代方案。然而,何种设计能够产生最有效的序列模型仍不明确。我们比较了三种主流方法:xLSTM、Mamba-2和门控DeltaNet。在具有复杂依赖关系的任务中评估这些模型:(1)代码模型预训练,(2)从大语言模型中蒸馏代码模型,(3)时间序列基础模型预训练。在这些场景下,xLSTM展现出最强的整体性能。为解释xLSTM的优势,我们提出统一公式并分析底层架构机制,重点关注状态追踪和记忆动态特性。结果表明,xLSTM通过其门控机制实现了更灵活、更稳定的记忆修正。我们通过可控合成长度泛化任务验证了这些发现。总体而言,我们的研究结果表明,xLSTM在复杂任务上的优势源于其稳健的状态追踪与累积能力。
English
Transformers dominate modern sequence modeling, but their quadratic attention incurs substantial computational cost. Subquadratic architectures offer a scalable alternative. However, it remains unclear which designs yield the most effective sequence models. We compare three leading approaches: xLSTM, Mamba-2, and Gated DeltaNet. We evaluate these models on tasks with complex dependencies: (1) code-model pre-training, (2) distillation of code models from large language models, and (3) pre-training of time-series foundation models. Across these settings, xLSTM delivers the strongest overall performance. To explain xLSTM's advantage, we present a unified formulation and analyze the underlying architectural mechanisms, focusing on state tracking and memory dynamics. Our results show that xLSTM enables more flexible and stable memory correction via its gating scheme. We corroborate these findings on controlled synthetic length-generalization tasks. Overall, our findings indicate that xLSTM's gains on complex tasks stem from robust state tracking and accumulation.