算术变换器中的长度泛化

摘要

我们研究了transformer处理两个挑战的能力：学习基本整数算术和推广到比训练中看到的更长序列。我们发现相对位置嵌入使简单任务的长度推广成为可能，比如加法：在5位数上训练的模型可以执行15位数的求和。然而，这种方法在乘法上失败，我们提出了训练集引导：向训练集中添加一些（10到50个）长序列。我们展示了引导可以使在5位数乘以3位数的乘法上训练的模型推广到35乘以3的示例。我们还展示了模型可以为不同的推广长度进行引导，并且引导样本大小随训练集大小的对数变化。最后，我们讨论了引导在算术之外的潜在应用。

English

We examine how transformers cope with two challenges: learning basic integer arithmetic, and generalizing to longer sequences than seen during training. We find that relative position embeddings enable length generalization for simple tasks, such as addition: models trained on 5-digit numbers can perform 15-digit sums. However, this method fails for multiplication, and we propose train set priming: adding a few (10 to 50) long sequences to the training set. We show that priming allows models trained on 5-digit times 3-digit multiplications to generalize to 35times 3 examples. We also show that models can be primed for different generalization lengths, and that the priming sample size scales as the logarithm of the training set size. Finally, we discuss potential applications of priming beyond arithmetic.

算术变换器中的长度泛化

Length Generalization in Arithmetic Transformers

摘要

Support