山羊：经过微调的LLaMA在算术任务上胜过GPT-4

摘要

我们介绍了Goat，这是一个经过微调的LLaMA模型，在一系列算术任务中明显优于GPT-4。在一个合成生成的数据集上进行微调后，Goat在BIG-bench算术子任务上实现了最先进的性能。特别是，零热启动的Goat-7B与少热启动的PaLM-540B实现的准确率相匹敌甚至超越。令人惊讶的是，Goat只通过监督微调就能在大数字加法和减法上实现接近完美的准确性，而以前的预训练语言模型（如Bloom、OPT、GPT-NeoX等）几乎无法做到这一点。我们将Goat的出色性能归因于LLaMA对数字的一致标记化。为了解决更具挑战性的任务，如大数字乘法和除法，我们提出了一种基于可学习性对任务进行分类的方法，并随后通过利用基本算术原理，将不可学习的任务（如多位数乘法和除法）分解为一系列可学习的任务。我们对模型的性能进行了彻底检查，提供了对我们提出的分解步骤有效性的全面评估。此外，Goat-7B可以在具有24GB VRAM GPU的LoRA上轻松训练，为其他研究人员提供了可重现性。我们发布了我们的模型、数据集以及用于数据集生成的Python脚本。

English

We introduce Goat, a fine-tuned LLaMA model that significantly outperforms GPT-4 on a range of arithmetic tasks. Fine-tuned on a synthetically generated dataset, Goat achieves state-of-the-art performance on BIG-bench arithmetic sub-task. In particular, the zero-shot Goat-7B matches or even surpasses the accuracy achieved by the few-shot PaLM-540B. Surprisingly, Goat can achieve near-perfect accuracy on large-number addition and subtraction through supervised fine-tuning only, which is almost impossible with previous pretrained language models, such as Bloom, OPT, GPT-NeoX, etc. We attribute Goat's exceptional performance to LLaMA's consistent tokenization of numbers. To tackle more challenging tasks like large-number multiplication and division, we propose an approach that classifies tasks based on their learnability, and subsequently decomposes unlearnable tasks, such as multi-digit multiplication and division, into a series of learnable tasks by leveraging basic arithmetic principles. We thoroughly examine the performance of our model, offering a comprehensive evaluation of the effectiveness of our proposed decomposition steps. Additionally, Goat-7B can be easily trained using LoRA on a 24GB VRAM GPU, facilitating reproducibility for other researchers. We release our model, dataset, and the Python script for dataset generation.

山羊：经过微调的LLaMA在算术任务上胜过GPT-4

Goat: Fine-tuned LLaMA Outperforms GPT-4 on Arithmetic Tasks

摘要

Support