跟我重复：变压器在复制方面优于状态空间模型。

摘要

Transformer架构是序列建模中的主导，但越来越多的人对使用不依赖于序列长度的固定大小潜在状态的模型表现出兴趣，我们称之为“广义状态空间模型”（GSSMs）。本文展示了，虽然GSSMs在推理效率方面很有前途，但在需要从输入上下文复制的任务上，与Transformer模型相比存在局限性。我们从理论上分析了简单的字符串复制任务，并证明了双层Transformer可以复制指数长度的字符串，而GSSMs受其固定大小潜在状态的基本限制。在实证方面，我们发现在需要复制上下文的合成任务中，Transformer在效率和泛化方面优于GSSMs。最后，我们评估了预训练的大型语言模型，并发现Transformer模型在复制和从上下文中检索信息方面远远优于状态空间模型。综合这些结果表明，在实际任务中，Transformer和GSSMs之间存在根本差距。

English

Transformers are the dominant architecture for sequence modeling, but there is growing interest in models that use a fixed-size latent state that does not depend on the sequence length, which we refer to as "generalized state space models" (GSSMs). In this paper we show that while GSSMs are promising in terms of inference-time efficiency, they are limited compared to transformer models on tasks that require copying from the input context. We start with a theoretical analysis of the simple task of string copying and prove that a two layer transformer can copy strings of exponential length while GSSMs are fundamentally limited by their fixed-size latent state. Empirically, we find that transformers outperform GSSMs in terms of efficiency and generalization on synthetic tasks that require copying the context. Finally, we evaluate pretrained large language models and find that transformer models dramatically outperform state space models at copying and retrieving information from context. Taken together, these results suggest a fundamental gap between transformers and GSSMs on tasks of practical interest.

跟我重复：变压器在复制方面优于状态空间模型。

Repeat After Me: Transformers are Better than State Space Models at Copying

摘要

Support