算術變壓器中的長度泛化

摘要

我們研究了Transformer如何應對兩個挑戰：學習基本整數算術，以及對比訓練過程中見過的更長序列的泛化能力。我們發現相對位置嵌入使得在簡單任務中可以實現長度泛化，例如加法：在5位數字上訓練的模型可以執行15位數字的加總。然而，這種方法對於乘法失效，我們提出了訓練集啟動：將一些（10至50個）長序列添加到訓練集中。我們展示了啟動使得在5位數乘以3位數的乘法上訓練的模型可以泛化到35乘以3的例子。我們還展示了模型可以為不同的泛化長度進行啟動，並且啟動樣本大小隨著訓練集大小的對數縮放。最後，我們討論了啟動在算術之外的潛在應用。

English

We examine how transformers cope with two challenges: learning basic integer arithmetic, and generalizing to longer sequences than seen during training. We find that relative position embeddings enable length generalization for simple tasks, such as addition: models trained on 5-digit numbers can perform 15-digit sums. However, this method fails for multiplication, and we propose train set priming: adding a few (10 to 50) long sequences to the training set. We show that priming allows models trained on 5-digit times 3-digit multiplications to generalize to 35times 3 examples. We also show that models can be primed for different generalization lengths, and that the priming sample size scales as the logarithm of the training set size. Finally, we discuss potential applications of priming beyond arithmetic.

算術變壓器中的長度泛化

Length Generalization in Arithmetic Transformers

摘要

Support