ChatPaper.aiChatPaper

教導小型Transformer進行算術

Teaching Arithmetic to Small Transformers

July 7, 2023
作者: Nayoung Lee, Kartik Sreenivasan, Jason D. Lee, Kangwook Lee, Dimitris Papailiopoulos
cs.AI

摘要

像GPT-4這樣的大型語言模型在訓練於大量文本數據時,展現出跨通用任務的新興能力,例如基本算術,即使這些任務並未被非監督式的下一個標記預測目標明確編碼。本研究探討了如何從隨機初始化開始訓練的小型Transformer可以有效地學習算術運算,例如加法、乘法和像平方根這樣的基本函數,並使用下一個標記預測目標。我們首先證明傳統的訓練數據對於算術學習並不是最有效的,而簡單的格式更改可以顯著提高準確性。這導致了作為訓練數據規模函數的尖銳相變,有些情況下可以通過與低秩矩陣補全的聯繫來解釋。在之前的工作基礎上,我們接著訓練了包含中間步驟結果的思維鏈式數據。即使在完全沒有預訓練的情況下,這種方法顯著且同時提高了準確性、樣本複雜度和收斂速度。我們還研究了訓練過程中算術和文本數據之間的相互作用,並檢驗了少量提示、預訓練和模型規模的影響。此外,我們討論了長度泛化挑戰。我們的工作突顯了高質量、有教育意義的數據的重要性,該數據考慮了快速引出算術能力所需的下一個單詞預測目標的特定特徵。
English
Large language models like GPT-4 exhibit emergent capabilities across general-purpose tasks, such as basic arithmetic, when trained on extensive text data, even though these tasks are not explicitly encoded by the unsupervised, next-token prediction objective. This study investigates how small transformers, trained from random initialization, can efficiently learn arithmetic operations such as addition, multiplication, and elementary functions like square root, using the next-token prediction objective. We first demonstrate that conventional training data is not the most effective for arithmetic learning, and simple formatting changes can significantly improve accuracy. This leads to sharp phase transitions as a function of training data scale, which, in some cases, can be explained through connections to low-rank matrix completion. Building on prior work, we then train on chain-of-thought style data that includes intermediate step results. Even in the complete absence of pretraining, this approach significantly and simultaneously improves accuracy, sample complexity, and convergence speed. We also study the interplay between arithmetic and text data during training and examine the effects of few-shot prompting, pretraining, and model scale. Additionally, we discuss length generalization challenges. Our work highlights the importance of high-quality, instructive data that considers the particular characteristics of the next-word prediction objective for rapidly eliciting arithmetic capabilities.
PDF180December 15, 2024