Transformer模型可以实现长度泛化,但缺乏稳健性。
Transformers Can Achieve Length Generalization But Not Robustly
February 14, 2024
作者: Yongchao Zhou, Uri Alon, Xinyun Chen, Xuezhi Wang, Rishabh Agarwal, Denny Zhou
cs.AI
摘要
长度泛化,即从较短的训练序列推广到更长的测试序列的能力,对语言模型来说是一个重要挑战。即使是处理相对简单任务的大规模Transformer,这个问题仍然存在。在本文中,我们使用两个整数相加的任务来测试Transformer的长度泛化能力。我们展示了长度泛化成功与数据格式和位置编码类型密切相关。通过使用正确的数据格式和位置编码组合,我们首次展示标准Transformer可以推广到输入长度的2.5倍的序列长度。然而,与分布内泛化不同,长度泛化仍然脆弱,受到诸如随机权重初始化和训练数据顺序等因素的显著影响,导致在不同随机种子之间存在较大方差。
English
Length generalization, defined as the ability to extrapolate from shorter
training sequences to longer test ones, is a significant challenge for language
models. This issue persists even with large-scale Transformers handling
relatively straightforward tasks. In this paper, we test the Transformer's
ability of length generalization using the task of addition of two integers. We
show that the success of length generalization is intricately linked to the
data format and the type of position encoding. Using the right combination of
data format and position encodings, we show for the first time that standard
Transformers can extrapolate to a sequence length that is 2.5x the input
length. Nevertheless, unlike in-distribution generalization, length
generalization remains fragile, significantly influenced by factors like random
weight initialization and training data order, leading to large variances
across different random seeds.Summary
AI-Generated Summary