跟我重複：變壓器在複製方面比狀態空間模型表現更好。

摘要

Transformer是序列建模中佔主導地位的架構，但越來越多人對使用不依賴序列長度的固定大小潛在狀態的模型感興趣，我們稱之為「廣義狀態空間模型」（GSSMs）。在本文中，我們展示了雖然GSSMs在推理效率方面很有潛力，但在需要從輸入上下文進行複製的任務上，與Transformer模型相比存在局限性。我們從對簡單的字符串複製任務的理論分析開始，證明了雙層Transformer能夠複製指數長度的字符串，而GSSMs基本上受固定大小潛在狀態的限制。在實證方面，我們發現在需要複製上下文的合成任務中，Transformer在效率和泛化方面優於GSSMs。最後，我們評估了預訓練的大型語言模型，發現Transformer模型在從上下文複製和檢索信息方面明顯優於狀態空間模型。綜合這些結果，表明在實際感興趣的任務上，Transformer和GSSMs之間存在根本性差距。

English

Transformers are the dominant architecture for sequence modeling, but there is growing interest in models that use a fixed-size latent state that does not depend on the sequence length, which we refer to as "generalized state space models" (GSSMs). In this paper we show that while GSSMs are promising in terms of inference-time efficiency, they are limited compared to transformer models on tasks that require copying from the input context. We start with a theoretical analysis of the simple task of string copying and prove that a two layer transformer can copy strings of exponential length while GSSMs are fundamentally limited by their fixed-size latent state. Empirically, we find that transformers outperform GSSMs in terms of efficiency and generalization on synthetic tasks that require copying the context. Finally, we evaluate pretrained large language models and find that transformer models dramatically outperform state space models at copying and retrieving information from context. Taken together, these results suggest a fundamental gap between transformers and GSSMs on tasks of practical interest.

跟我重複：變壓器在複製方面比狀態空間模型表現更好。

Repeat After Me: Transformers are Better than State Space Models at Copying

摘要

Support