隨機位置編碼提升Transformer的長度泛化能力
Randomized Positional Encodings Boost Length Generalization of Transformers
May 26, 2023
作者: Anian Ruoss, Grégoire Delétang, Tim Genewein, Jordi Grau-Moya, Róbert Csordás, Mehdi Bennani, Shane Legg, Joel Veness
cs.AI
摘要
Transformer 在固定上下文長度的任務上展現出令人印象深刻的泛化能力。然而,它們無法泛化到任意長度的序列,即使對於看似簡單的任務,如複製字符串。此外,僅僅在更長的序列上進行訓練是低效的,因為全局注意機制的計算複雜度是二次的。在這項工作中,我們展示了這種失敗模式與位置編碼與更長序列(即使是相對編碼)的分布不匹配有關,並引入了一個新的位置編碼家族,可以克服這個問題。具體來說,我們的隨機位置編碼方案模擬了更長序列的位置,並隨機選擇一個有序子集來符合序列的長度。我們對 6000 個模型在 15 個算法推理任務上進行了大規模實證評估,結果顯示我們的方法使 Transformer 能夠泛化到未見長度的序列(平均測試準確度提高了 12.0%)。
English
Transformers have impressive generalization capabilities on tasks with a
fixed context length. However, they fail to generalize to sequences of
arbitrary length, even for seemingly simple tasks such as duplicating a string.
Moreover, simply training on longer sequences is inefficient due to the
quadratic computation complexity of the global attention mechanism. In this
work, we demonstrate that this failure mode is linked to positional encodings
being out-of-distribution for longer sequences (even for relative encodings)
and introduce a novel family of positional encodings that can overcome this
problem. Concretely, our randomized positional encoding scheme simulates the
positions of longer sequences and randomly selects an ordered subset to fit the
sequence's length. Our large-scale empirical evaluation of 6000 models across
15 algorithmic reasoning tasks shows that our method allows Transformers to
generalize to sequences of unseen length (increasing test accuracy by 12.0% on
average).