教導小型Transformer進行算術
Teaching Arithmetic to Small Transformers
July 7, 2023
作者: Nayoung Lee, Kartik Sreenivasan, Jason D. Lee, Kangwook Lee, Dimitris Papailiopoulos
cs.AI
摘要
像GPT-4這樣的大型語言模型在訓練於大量文本數據時,展現出跨通用任務的新興能力,例如基本算術,即使這些任務並未被非監督式的下一個標記預測目標明確編碼。本研究探討了如何從隨機初始化開始訓練的小型Transformer可以有效地學習算術運算,例如加法、乘法和像平方根這樣的基本函數,並使用下一個標記預測目標。我們首先證明傳統的訓練數據對於算術學習並不是最有效的,而簡單的格式更改可以顯著提高準確性。這導致了作為訓練數據規模函數的尖銳相變,有些情況下可以通過與低秩矩陣補全的聯繫來解釋。在之前的工作基礎上,我們接著訓練了包含中間步驟結果的思維鏈式數據。即使在完全沒有預訓練的情況下,這種方法顯著且同時提高了準確性、樣本複雜度和收斂速度。我們還研究了訓練過程中算術和文本數據之間的相互作用,並檢驗了少量提示、預訓練和模型規模的影響。此外,我們討論了長度泛化挑戰。我們的工作突顯了高質量、有教育意義的數據的重要性,該數據考慮了快速引出算術能力所需的下一個單詞預測目標的特定特徵。
English
Large language models like GPT-4 exhibit emergent capabilities across
general-purpose tasks, such as basic arithmetic, when trained on extensive text
data, even though these tasks are not explicitly encoded by the unsupervised,
next-token prediction objective. This study investigates how small
transformers, trained from random initialization, can efficiently learn
arithmetic operations such as addition, multiplication, and elementary
functions like square root, using the next-token prediction objective. We first
demonstrate that conventional training data is not the most effective for
arithmetic learning, and simple formatting changes can significantly improve
accuracy. This leads to sharp phase transitions as a function of training data
scale, which, in some cases, can be explained through connections to low-rank
matrix completion. Building on prior work, we then train on chain-of-thought
style data that includes intermediate step results. Even in the complete
absence of pretraining, this approach significantly and simultaneously improves
accuracy, sample complexity, and convergence speed. We also study the interplay
between arithmetic and text data during training and examine the effects of
few-shot prompting, pretraining, and model scale. Additionally, we discuss
length generalization challenges. Our work highlights the importance of
high-quality, instructive data that considers the particular characteristics of
the next-word prediction objective for rapidly eliciting arithmetic
capabilities.