Transformerは長さの一般化を達成可能だが、堅牢ではない

要旨

長さの一般化、つまりより短い訓練シーケンスからより長いテストシーケンスへ外挿する能力は、言語モデルにとって重要な課題です。この問題は、比較的単純なタスクを扱う大規模なTransformerにおいても依然として存在します。本論文では、2つの整数の加算というタスクを用いて、Transformerの長さ一般化能力を検証します。長さの一般化の成功は、データ形式と位置エンコーディングの種類に密接に関連していることを示します。適切なデータ形式と位置エンコーディングの組み合わせを用いることで、標準的なTransformerが入力長の2.5倍のシーケンス長に外挿できることを初めて実証します。しかしながら、分布内の一般化とは異なり、長さの一般化は脆弱であり、ランダムな重み初期化や訓練データの順序などの要因に大きく影響され、異なるランダムシード間で大きなばらつきが生じます。

English

Length generalization, defined as the ability to extrapolate from shorter training sequences to longer test ones, is a significant challenge for language models. This issue persists even with large-scale Transformers handling relatively straightforward tasks. In this paper, we test the Transformer's ability of length generalization using the task of addition of two integers. We show that the success of length generalization is intricately linked to the data format and the type of position encoding. Using the right combination of data format and position encodings, we show for the first time that standard Transformers can extrapolate to a sequence length that is 2.5x the input length. Nevertheless, unlike in-distribution generalization, length generalization remains fragile, significantly influenced by factors like random weight initialization and training data order, leading to large variances across different random seeds.

Transformerは長さの一般化を達成可能だが、堅牢ではない

Transformers Can Achieve Length Generalization But Not Robustly

要旨

Support