Transformer 可以實現長度泛化，但不具強健性。

摘要

長度泛化，定義為從較短的訓練序列推斷到較長的測試序列的能力，對於語言模型來說是一個重要挑戰。即使是處理相對簡單任務的大型Transformer，這個問題仍然存在。在本文中，我們使用兩個整數相加的任務來測試Transformer的長度泛化能力。我們展示了長度泛化成功與數據格式和位置編碼類型密切相關。通過適當組合數據格式和位置編碼，我們首次展示標準Transformer可以推斷到輸入長度的2.5倍的序列長度。然而，與分布內泛化不同，長度泛化仍然脆弱，受到隨機權重初始化和訓練數據順序等因素的顯著影響，導致在不同隨機種子之間存在較大的變異。

English

Length generalization, defined as the ability to extrapolate from shorter training sequences to longer test ones, is a significant challenge for language models. This issue persists even with large-scale Transformers handling relatively straightforward tasks. In this paper, we test the Transformer's ability of length generalization using the task of addition of two integers. We show that the success of length generalization is intricately linked to the data format and the type of position encoding. Using the right combination of data format and position encodings, we show for the first time that standard Transformers can extrapolate to a sequence length that is 2.5x the input length. Nevertheless, unlike in-distribution generalization, length generalization remains fragile, significantly influenced by factors like random weight initialization and training data order, leading to large variances across different random seeds.

Transformer 可以實現長度泛化，但不具強健性。

Transformers Can Achieve Length Generalization But Not Robustly

摘要

Support