コードを書くべきか、書かざるべきか？事前学習におけるコードの影響を探る

要旨

コードを事前学習データの混合に含めることは、コード専用に設計されていないモデルであっても、LLMの事前学習において一般的な慣行となっています。実務家の間では、コードデータが一般的なLLMの性能に重要な役割を果たすという経験則的な合意があるものの、非コードタスクに対するコードの正確な影響を分析した研究は限られています。本研究では、コードデータが一般的な性能に与える影響を体系的に調査します。私たちは「事前学習で使用されたコードデータが、コード生成を超えた多様な下流タスクにどのような影響を与えるか」という問いを立てます。470Mから2.8Bパラメータまでのモデルを対象に、広範な自然言語推論タスク、世界知識タスク、コードベンチマーク、およびLLM-as-a-judgeの勝率評価を行い、広範なアブレーション実験を実施しました。すべての設定において、コードがコーディングタスクをはるかに超えた汎化のための重要な構成要素であり、コード品質の向上がすべてのタスクに大きな影響を与えるという一貫した結果が見られました。特に、テキストのみの事前学習と比較して、コードを追加することで、自然言語（NL）推論では最大8.2%、世界知識では4.2%、生成勝率では6.6%の相対的な向上が確認され、コード性能では12倍の向上が見られました。私たちの研究は、コード品質への投資と事前学習中のコードの保持がポジティブな影響をもたらすことを示唆しています。

English

Including code in the pre-training data mixture, even for models not specifically designed for code, has become a common practice in LLMs pre-training. While there has been anecdotal consensus among practitioners that code data plays a vital role in general LLMs' performance, there is only limited work analyzing the precise impact of code on non-code tasks. In this work, we systematically investigate the impact of code data on general performance. We ask "what is the impact of code data used in pre-training on a large variety of downstream tasks beyond code generation". We conduct extensive ablations and evaluate across a broad range of natural language reasoning tasks, world knowledge tasks, code benchmarks, and LLM-as-a-judge win-rates for models with sizes ranging from 470M to 2.8B parameters. Across settings, we find a consistent results that code is a critical building block for generalization far beyond coding tasks and improvements to code quality have an outsized impact across all tasks. In particular, compared to text-only pre-training, the addition of code results in up to relative increase of 8.2% in natural language (NL) reasoning, 4.2% in world knowledge, 6.6% improvement in generative win-rates, and a 12x boost in code performance respectively. Our work suggests investments in code quality and preserving code during pre-training have positive impacts.

コードを書くべきか、書かざるべきか？事前学習におけるコードの影響を探る

To Code, or Not To Code? Exploring Impact of Code in Pre-training

要旨

Support