论序列模型中的“归纳偏置”

摘要

尽管基于Transformer的语言模型取得了显著的实际成功，但近期研究对其状态追踪能力提出了质疑。越来越多的文献主要通过分布外泛化（如长度外推）的失败案例揭示了这一局限。本研究将关注点转向这些局限在分布内的影响，通过大规模实验比较了Transformer与循环神经网络在不同监督机制下的数据效率。研究发现：随着状态空间规模和序列长度的增加，Transformer所需训练数据量的增长速度远超RNN。此外，我们分析了已学习的状态追踪机制在不同序列长度间的共享程度。结果表明，Transformer在不同长度间的权重共享可忽略不计甚至存在负面影响，表明其孤立地学习了长度特定的解决方案。相比之下，循环模型通过跨长度权重共享实现了有效的摊销学习，使得某一序列长度的训练数据能够提升其他长度的性能。这些发现共同证明，即使训练与评估数据分布一致，状态追踪仍是Transformer面临的基础性挑战。

English

Despite the remarkable practical success of transformer-based language models, recent work has raised concerns about their ability to perform state tracking. In particular, a growing body of literature has shown this limitation primarily through failures in out-of-distribution (OOD) generalization, such as length extrapolation. In this work, we shift attention to the in-distribution implications of these limitations. We conduct a large-scale experimental study of the data efficiency of transformers and recurrent neural networks (RNNs) across multiple supervision regimes. We find that the amount of training data required by transformers grows much more rapidly with state-space size and sequence length than for RNNs. Furthermore, we analyze the extent to which learned state-tracking mechanisms are shared across different sequence lengths. We show that transformers exhibit negligible or even detrimental weight sharing across lengths, indicating that they learn length-specific solutions in isolation. In contrast, recurrent models exhibit effective amortized learning by sharing weights across lengths, allowing data from one sequence length to improve performance on others. Together, these results demonstrate that state tracking remains a fundamental challenge for transformers, even when training and evaluation distributions match.