要編碼，還是不編碼？探索編碼在預訓練中的影響。

摘要

將程式碼包含在預訓練數據混合中，即使對於非專門設計用於程式碼的模型而言，已成為LLM預訓練中的常見做法。儘管從業者之間普遍認為程式碼數據對於一般LLM的性能至關重要，但僅有有限的研究分析了程式碼對非程式碼任務的確切影響。在這項研究中，我們系統地調查了程式碼數據對一般性能的影響。我們探討「將程式碼數據用於預訓練對超出程式碼生成範疇的各種下游任務有何影響」。我們進行了大量的消融實驗，並在廣泛的自然語言推理任務、世界知識任務、程式基準和LLM作為評判的勝率上進行評估，模型規模從4.7億到28億個參數不等。在各種設置中，我們得出一致的結果，即程式碼是通用化的關鍵基石，遠超出編碼任務，並且提高程式碼質量對所有任務都有巨大影響。特別是，相較於僅文本預訓練，加入程式碼可使自然語言(NL)推理提高最高達8.2%，世界知識提高4.2%，生成勝率提高6.6%，程式碼性能提高12倍。我們的研究表明，投資於程式碼質量和在預訓練期間保留程式碼對有正面影響。

English

Including code in the pre-training data mixture, even for models not specifically designed for code, has become a common practice in LLMs pre-training. While there has been anecdotal consensus among practitioners that code data plays a vital role in general LLMs' performance, there is only limited work analyzing the precise impact of code on non-code tasks. In this work, we systematically investigate the impact of code data on general performance. We ask "what is the impact of code data used in pre-training on a large variety of downstream tasks beyond code generation". We conduct extensive ablations and evaluate across a broad range of natural language reasoning tasks, world knowledge tasks, code benchmarks, and LLM-as-a-judge win-rates for models with sizes ranging from 470M to 2.8B parameters. Across settings, we find a consistent results that code is a critical building block for generalization far beyond coding tasks and improvements to code quality have an outsized impact across all tasks. In particular, compared to text-only pre-training, the addition of code results in up to relative increase of 8.2% in natural language (NL) reasoning, 4.2% in world knowledge, 6.6% improvement in generative win-rates, and a 12x boost in code performance respectively. Our work suggests investments in code quality and preserving code during pre-training have positive impacts.

要編碼，還是不編碼？探索編碼在預訓練中的影響。

To Code, or Not To Code? Exploring Impact of Code in Pre-training

摘要

Support