コードのスケーリング則：あらゆるプログラミング言語が重要である

要旨

コード大規模言語モデル（Code LLM）は強力であるが、学習コストが高く、スケーリング則ではモデルサイズ、データ量、計算量から性能が予測される。しかし、異なるプログラミング言語（PL）は事前学習において様々な影響を与え、ベースモデルの性能を大きく左右するため、性能予測が不正確になる。さらに、既存研究は言語非依存的な設定に焦点を当てており、現代のソフトウェア開発において本質的に多言語化が進んでいる状況を無視している。したがって、まず各PLのスケーリング則を調査し、その後それらの相互影響を考慮して最終的な多言語スケーリング則を導出する必要がある。本論文では、複数のPL、モデルサイズ（0.2B～14Bパラメータ）、データセットサイズ（1Tトークン）にわたる1,000件以上の実験（H800時間換算で336,000時間以上に相当）を通じて、多言語コード事前学習におけるスケーリング則の体系的な初の探求を行う。我々は複数のPLにわたるコードLLMの包括的なスケーリング則を確立し、インタプリタ言語（Pythonなど）はコンパイル言語（Rustなど）に比べて、モデルサイズとデータ量の増加による恩恵が大きいことを明らかにした。本研究は、特に構文的に類似したPL間において、多言語事前学習が相乗効果をもたらすことを実証している。さらに、並列ペアリング（コード片とその翻訳を連結する）という事前学習戦略が、良好なスケーリング特性を持ちながら言語横断的能力を大幅に向上させる。最後に、比例依存型多言語スケーリング則を提案し、高効率なPL（Pythonなど）を優先し、高相乗効果のペア（JavaScript-TypeScriptなど）のバランスを調整し、飽和の早い言語（Rust）への割り当てを減らすことで、同じ計算予算の下で一様分布と比較して全てのPLにわたる平均性能を優位に高める訓練トークンの最適配分を実現する。

English

Code large language models (Code LLMs) are powerful but costly to train, with scaling laws predicting performance from model size, data, and compute. However, different programming languages (PLs) have varying impacts during pre-training that significantly affect base model performance, leading to inaccurate performance prediction. Besides, existing works focus on language-agnostic settings, neglecting the inherently multilingual nature of modern software development. Therefore, it is first necessary to investigate the scaling laws of different PLs, and then consider their mutual influences to arrive at the final multilingual scaling law. In this paper, we present the first systematic exploration of scaling laws for multilingual code pre-training, conducting over 1000+ experiments (Equivalent to 336,000+ H800 hours) across multiple PLs, model sizes (0.2B to 14B parameters), and dataset sizes (1T tokens). We establish comprehensive scaling laws for code LLMs across multiple PLs, revealing that interpreted languages (e.g., Python) benefit more from increased model size and data than compiled languages (e.g., Rust). The study demonstrates that multilingual pre-training provides synergistic benefits, particularly between syntactically similar PLs. Further, the pre-training strategy of the parallel pairing (concatenating code snippets with their translations) significantly enhances cross-lingual abilities with favorable scaling properties. Finally, a proportion-dependent multilingual scaling law is proposed to optimally allocate training tokens by prioritizing high-utility PLs (e.g., Python), balancing high-synergy pairs (e.g., JavaScript-TypeScript), and reducing allocation to fast-saturating languages (Rust), achieving superior average performance across all PLs compared to uniform distribution under the same compute budget.

コードのスケーリング則：あらゆるプログラミング言語が重要である

Scaling Laws for Code: Every Programming Language Matters

要旨

Support