随机位置编码提升Transformer模型的长度泛化能力

摘要

Transformer在具有固定上下文长度的任务上具有令人印象深刻的泛化能力。然而，它们无法泛化到任意长度的序列，即使是看伿简单的任务，比如复制一个字符串。此外，仅仅在更长的序列上进行训练是低效的，因为全局注意机制的计算复杂度是二次的。在这项工作中，我们展示了这种失败模式与位置编码与更长序列（即使是相对编码）的分布不一臿有关，并引入了一种能够克服这一问题的新型位置编码家族。具体来说，我们的随机位置编码方案模拟了更长序列的位置，并随机选择一个有序子集来适应序列的长度。我们对6000个模型在15个算法推理任务上的大规模实证评估表明，我们的方法使Transformer能够泛化到未见长度的序列（平均测试准确率提高了12.0%）。

English

Transformers have impressive generalization capabilities on tasks with a fixed context length. However, they fail to generalize to sequences of arbitrary length, even for seemingly simple tasks such as duplicating a string. Moreover, simply training on longer sequences is inefficient due to the quadratic computation complexity of the global attention mechanism. In this work, we demonstrate that this failure mode is linked to positional encodings being out-of-distribution for longer sequences (even for relative encodings) and introduce a novel family of positional encodings that can overcome this problem. Concretely, our randomized positional encoding scheme simulates the positions of longer sequences and randomly selects an ordered subset to fit the sequence's length. Our large-scale empirical evaluation of 6000 models across 15 algorithmic reasoning tasks shows that our method allows Transformers to generalize to sequences of unseen length (increasing test accuracy by 12.0% on average).

随机位置编码提升Transformer模型的长度泛化能力

Randomized Positional Encodings Boost Length Generalization of Transformers

摘要

Support