Transformer模型可以实现长度泛化，但缺乏稳健性。

摘要

长度泛化，即从较短的训练序列推广到更长的测试序列的能力，对语言模型来说是一个重要挑战。即使是处理相对简单任务的大规模Transformer，这个问题仍然存在。在本文中，我们使用两个整数相加的任务来测试Transformer的长度泛化能力。我们展示了长度泛化成功与数据格式和位置编码类型密切相关。通过使用正确的数据格式和位置编码组合，我们首次展示标准Transformer可以推广到输入长度的2.5倍的序列长度。然而，与分布内泛化不同，长度泛化仍然脆弱，受到诸如随机权重初始化和训练数据顺序等因素的显著影响，导致在不同随机种子之间存在较大方差。

English

Length generalization, defined as the ability to extrapolate from shorter training sequences to longer test ones, is a significant challenge for language models. This issue persists even with large-scale Transformers handling relatively straightforward tasks. In this paper, we test the Transformer's ability of length generalization using the task of addition of two integers. We show that the success of length generalization is intricately linked to the data format and the type of position encoding. Using the right combination of data format and position encodings, we show for the first time that standard Transformers can extrapolate to a sequence length that is 2.5x the input length. Nevertheless, unlike in-distribution generalization, length generalization remains fragile, significantly influenced by factors like random weight initialization and training data order, leading to large variances across different random seeds.

Transformer模型可以实现长度泛化，但缺乏稳健性。

Transformers Can Achieve Length Generalization But Not Robustly

摘要

Support