序列模型中的"归纳偏置"探析

摘要

儘管基於Transformer的語言模型在實踐中取得了顯著成功，但近期研究對其執行狀態追蹤的能力提出了質疑。越來越多的文獻主要通過分佈外泛化（例如長度外推）的失敗案例揭示了這一局限性。本研究將關注點轉向這些局限性的分佈內影響，通過大規模實驗對比了Transformer與循環神經網絡在多種監督機制下的數據效率。我們發現：隨狀態空間規模和序列長度的增加，Transformer所需的訓練數據量增長速度遠超RNN。此外，我們分析了已學習的狀態追蹤機制在不同序列長度間的共享程度，結果表明Transformer表現出可忽略甚至有害的跨長度權重共享，意味著其孤立地學習長度特定的解決方案。相比之下，循環模型通過跨長度權重共享實現了有效的攤銷學習，使得單一序列長度的訓練數據能提升其他長度的表現。這些結果共同表明，即使訓練與評估分佈一致，狀態追蹤仍是Transformer面臨的根本性挑戰。

English

Despite the remarkable practical success of transformer-based language models, recent work has raised concerns about their ability to perform state tracking. In particular, a growing body of literature has shown this limitation primarily through failures in out-of-distribution (OOD) generalization, such as length extrapolation. In this work, we shift attention to the in-distribution implications of these limitations. We conduct a large-scale experimental study of the data efficiency of transformers and recurrent neural networks (RNNs) across multiple supervision regimes. We find that the amount of training data required by transformers grows much more rapidly with state-space size and sequence length than for RNNs. Furthermore, we analyze the extent to which learned state-tracking mechanisms are shared across different sequence lengths. We show that transformers exhibit negligible or even detrimental weight sharing across lengths, indicating that they learn length-specific solutions in isolation. In contrast, recurrent models exhibit effective amortized learning by sharing weights across lengths, allowing data from one sequence length to improve performance on others. Together, these results demonstrate that state tracking remains a fundamental challenge for transformers, even when training and evaluation distributions match.

序列模型中的"归纳偏置"探析

On the "Induction Bias" in Sequence Models

摘要

Support