ChatPaper.aiChatPaper

编码还是不编码?探讨编码在预训练中的影响。

To Code, or Not To Code? Exploring Impact of Code in Pre-training

August 20, 2024
作者: Viraat Aryabumi, Yixuan Su, Raymond Ma, Adrien Morisot, Ivan Zhang, Acyr Locatelli, Marzieh Fadaee, Ahmet Üstün, Sara Hooker
cs.AI

摘要

在预训练数据混合中包含代码,即使对于非专门设计用于代码的模型而言,已经成为LLM预训练中的常见做法。虽然从业者之间普遍认为代码数据在一般LLM性能中起着至关重要的作用,但对代码在非代码任务上的确切影响的分析工作却有限。在这项工作中,我们系统地研究了代码数据对一般性能的影响。我们提出了一个问题:“预训练中使用的代码数据对代码生成之外的大量下游任务有何影响”。我们进行了广泛的消融实验,并在广泛的自然语言推理任务、世界知识任务、代码基准测试以及LLM作为评判者的胜率上进行评估,模型参数范围从4.7亿到28亿不等。在各种设置中,我们发现一个一致的结果,即代码是远远超出编码任务的泛化的关键构建模块,提高代码质量对所有任务都有巨大影响。特别是,与仅文本预训练相比,添加代码可使自然语言推理的相对增长率高达8.2%,世界知识增加4.2%,生成胜率提高6.6%,代码性能提升12倍。我们的工作表明,对代码质量的投资以及在预训练期间保留代码都会产生积极影响。
English
Including code in the pre-training data mixture, even for models not specifically designed for code, has become a common practice in LLMs pre-training. While there has been anecdotal consensus among practitioners that code data plays a vital role in general LLMs' performance, there is only limited work analyzing the precise impact of code on non-code tasks. In this work, we systematically investigate the impact of code data on general performance. We ask "what is the impact of code data used in pre-training on a large variety of downstream tasks beyond code generation". We conduct extensive ablations and evaluate across a broad range of natural language reasoning tasks, world knowledge tasks, code benchmarks, and LLM-as-a-judge win-rates for models with sizes ranging from 470M to 2.8B parameters. Across settings, we find a consistent results that code is a critical building block for generalization far beyond coding tasks and improvements to code quality have an outsized impact across all tasks. In particular, compared to text-only pre-training, the addition of code results in up to relative increase of 8.2% in natural language (NL) reasoning, 4.2% in world knowledge, 6.6% improvement in generative win-rates, and a 12x boost in code performance respectively. Our work suggests investments in code quality and preserving code during pre-training have positive impacts.

Summary

AI-Generated Summary

PDF432November 17, 2024