跟我重复:变压器在复制方面优于状态空间模型。
Repeat After Me: Transformers are Better than State Space Models at Copying
February 1, 2024
作者: Samy Jelassi, David Brandfonbrener, Sham M. Kakade, Eran Malach
cs.AI
摘要
Transformer架构是序列建模中的主导,但越来越多的人对使用不依赖于序列长度的固定大小潜在状态的模型表现出兴趣,我们称之为“广义状态空间模型”(GSSMs)。本文展示了,虽然GSSMs在推理效率方面很有前途,但在需要从输入上下文复制的任务上,与Transformer模型相比存在局限性。我们从理论上分析了简单的字符串复制任务,并证明了双层Transformer可以复制指数长度的字符串,而GSSMs受其固定大小潜在状态的基本限制。在实证方面,我们发现在需要复制上下文的合成任务中,Transformer在效率和泛化方面优于GSSMs。最后,我们评估了预训练的大型语言模型,并发现Transformer模型在复制和从上下文中检索信息方面远远优于状态空间模型。综合这些结果表明,在实际任务中,Transformer和GSSMs之间存在根本差距。
English
Transformers are the dominant architecture for sequence modeling, but there
is growing interest in models that use a fixed-size latent state that does not
depend on the sequence length, which we refer to as "generalized state space
models" (GSSMs). In this paper we show that while GSSMs are promising in terms
of inference-time efficiency, they are limited compared to transformer models
on tasks that require copying from the input context. We start with a
theoretical analysis of the simple task of string copying and prove that a two
layer transformer can copy strings of exponential length while GSSMs are
fundamentally limited by their fixed-size latent state. Empirically, we find
that transformers outperform GSSMs in terms of efficiency and generalization on
synthetic tasks that require copying the context. Finally, we evaluate
pretrained large language models and find that transformer models dramatically
outperform state space models at copying and retrieving information from
context. Taken together, these results suggest a fundamental gap between
transformers and GSSMs on tasks of practical interest.