論亞二次架構：從應用到原理

摘要

Transformer主導了現代序列建模，但其二次注意力機制帶來了巨大的計算成本。次二次架構提供了一個可擴展的替代方案。然而，究竟哪些設計能產生最有效的序列模型，目前仍不明確。我們比較了三種領先的方法：xLSTM、Mamba-2 和 Gated DeltaNet。我們在具有複雜依賴關係的任務上評估這些模型：(1) 程式碼模型的預訓練，(2) 從大型語言模型蒸餾程式碼模型，以及 (3) 時間序列基礎模型的預訓練。在這些設定中，xLSTM 提供了最強的整體性能。為了解釋 xLSTM 的優勢，我們提出了一個統一的公式化描述，並分析了底層的架構機制，重點關注狀態追蹤和記憶動態。我們的結果顯示，xLSTM 通過其閘控機制實現了更靈活且穩定的記憶修正。我們在受控的合成長度泛化任務上驗證了這些發現。總體而言，我們的研究結果表明，xLSTM 在複雜任務上的優勢源於其穩健的狀態追蹤與累積能力。

English

Transformers dominate modern sequence modeling, but their quadratic attention incurs substantial computational cost. Subquadratic architectures offer a scalable alternative. However, it remains unclear which designs yield the most effective sequence models. We compare three leading approaches: xLSTM, Mamba-2, and Gated DeltaNet. We evaluate these models on tasks with complex dependencies: (1) code-model pre-training, (2) distillation of code models from large language models, and (3) pre-training of time-series foundation models. Across these settings, xLSTM delivers the strongest overall performance. To explain xLSTM's advantage, we present a unified formulation and analyze the underlying architectural mechanisms, focusing on state tracking and memory dynamics. Our results show that xLSTM enables more flexible and stable memory correction via its gating scheme. We corroborate these findings on controlled synthetic length-generalization tasks. Overall, our findings indicate that xLSTM's gains on complex tasks stem from robust state tracking and accumulation.