教授算术给小型变压器

摘要

像GPT-4这样的大型语言模型在广泛文本数据训练下展现出跨通用任务的新兴能力，比如基本算术，尽管这些任务并未被无监督的下一个标记预测目标明确编码。本研究探讨了如何从随机初始化开始，小型transformers可以有效地学习加法、乘法和平方根等算术运算，利用下一个标记预测目标。我们首先证明传统训练数据对于算术学习并不是最有效的，简单的格式更改可以显著提高准确性。这导致随着训练数据规模的变化出现明显的相变，有些情况下可以通过与低秩矩阵补全的联系来解释。在之前的工作基础上，我们接着在包含中间步骤结果的思维链式数据上进行训练。即使在完全没有预训练的情况下，这种方法也显著地同时提高了准确性、样本复杂度和收敛速度。我们还研究了训练过程中算术和文本数据之间的相互作用，并检查了少样本提示、预训练和模型规模的影响。此外，我们讨论了长度泛化挑战。我们的工作强调了高质量、有启发性的数据的重要性，考虑了下一个单词预测目标的特定特征，以快速引出算术能力。

English

Large language models like GPT-4 exhibit emergent capabilities across general-purpose tasks, such as basic arithmetic, when trained on extensive text data, even though these tasks are not explicitly encoded by the unsupervised, next-token prediction objective. This study investigates how small transformers, trained from random initialization, can efficiently learn arithmetic operations such as addition, multiplication, and elementary functions like square root, using the next-token prediction objective. We first demonstrate that conventional training data is not the most effective for arithmetic learning, and simple formatting changes can significantly improve accuracy. This leads to sharp phase transitions as a function of training data scale, which, in some cases, can be explained through connections to low-rank matrix completion. Building on prior work, we then train on chain-of-thought style data that includes intermediate step results. Even in the complete absence of pretraining, this approach significantly and simultaneously improves accuracy, sample complexity, and convergence speed. We also study the interplay between arithmetic and text data during training and examine the effects of few-shot prompting, pretraining, and model scale. Additionally, we discuss length generalization challenges. Our work highlights the importance of high-quality, instructive data that considers the particular characteristics of the next-word prediction objective for rapidly eliciting arithmetic capabilities.

教授算术给小型变压器

Teaching Arithmetic to Small Transformers

摘要

Support