Transformer 可以實現長度泛化,但不具強健性。
Transformers Can Achieve Length Generalization But Not Robustly
February 14, 2024
作者: Yongchao Zhou, Uri Alon, Xinyun Chen, Xuezhi Wang, Rishabh Agarwal, Denny Zhou
cs.AI
摘要
長度泛化,定義為從較短的訓練序列推斷到較長的測試序列的能力,對於語言模型來說是一個重要挑戰。即使是處理相對簡單任務的大型Transformer,這個問題仍然存在。在本文中,我們使用兩個整數相加的任務來測試Transformer的長度泛化能力。我們展示了長度泛化成功與數據格式和位置編碼類型密切相關。通過適當組合數據格式和位置編碼,我們首次展示標準Transformer可以推斷到輸入長度的2.5倍的序列長度。然而,與分布內泛化不同,長度泛化仍然脆弱,受到隨機權重初始化和訓練數據順序等因素的顯著影響,導致在不同隨機種子之間存在較大的變異。
English
Length generalization, defined as the ability to extrapolate from shorter
training sequences to longer test ones, is a significant challenge for language
models. This issue persists even with large-scale Transformers handling
relatively straightforward tasks. In this paper, we test the Transformer's
ability of length generalization using the task of addition of two integers. We
show that the success of length generalization is intricately linked to the
data format and the type of position encoding. Using the right combination of
data format and position encodings, we show for the first time that standard
Transformers can extrapolate to a sequence length that is 2.5x the input
length. Nevertheless, unlike in-distribution generalization, length
generalization remains fragile, significantly influenced by factors like random
weight initialization and training data order, leading to large variances
across different random seeds.Summary
AI-Generated Summary