ランダム化位置符号化はTransformerの長さ一般化能力を向上させる

要旨

Transformerは、固定されたコンテキスト長を持つタスクにおいて、印象的な汎化能力を発揮します。しかし、任意の長さのシーケンスに対しては、文字列の複製のような一見単純なタスクであっても、汎化に失敗します。さらに、長いシーケンスで単純に訓練することは、グローバルアテンションメカニズムの二次的な計算複雑性のため、非効率的です。本研究では、この失敗モードが、長いシーケンスに対する位置エンコーディングが分布外となること（相対エンコーディングであっても）に関連していることを示し、この問題を克服できる新しい位置エンコーディングのファミリーを導入します。具体的には、我々のランダム化された位置エンコーディングスキームは、長いシーケンスの位置をシミュレートし、シーケンスの長さに合うように順序付けられたサブセットをランダムに選択します。15のアルゴリズム推論タスクにわたる6000のモデルに対する大規模な実証評価により、我々の方法がTransformerに見えない長さのシーケンスに汎化することを可能にし（平均してテスト精度を12.0%向上させる）ことが示されました。

English

Transformers have impressive generalization capabilities on tasks with a fixed context length. However, they fail to generalize to sequences of arbitrary length, even for seemingly simple tasks such as duplicating a string. Moreover, simply training on longer sequences is inefficient due to the quadratic computation complexity of the global attention mechanism. In this work, we demonstrate that this failure mode is linked to positional encodings being out-of-distribution for longer sequences (even for relative encodings) and introduce a novel family of positional encodings that can overcome this problem. Concretely, our randomized positional encoding scheme simulates the positions of longer sequences and randomly selects an ordered subset to fit the sequence's length. Our large-scale empirical evaluation of 6000 models across 15 algorithmic reasoning tasks shows that our method allows Transformers to generalize to sequences of unseen length (increasing test accuracy by 12.0% on average).

ランダム化位置符号化はTransformerの長さ一般化能力を向上させる

Randomized Positional Encodings Boost Length Generalization of Transformers

要旨

Support