代码扩展定律：每种编程语言都至关重要

摘要

代码大语言模型（Code LLM）虽功能强大但训练成本高昂，现有缩放定律通过模型规模、数据量和算力来预测性能。然而不同编程语言在预训练阶段产生的差异性影响会显著改变基础模型性能，导致预测失准。现有研究多关注语言无关场景，忽视了现代软件开发固有的多语言特性。因此需先探究不同编程语言的缩放规律，再考量其相互影响以建立最终的多语言缩放定律。本文首次系统探索多语言代码预训练的缩放规律，通过超1000次实验（等效336,000+ H800显卡小时）覆盖多种编程语言、模型规模（0.2B至14B参数）及数据集规模（1T标记）。我们建立了跨编程语言的完整缩放定律，发现解释型语言（如Python）从模型规模与数据量提升中的获益远大于编译型语言（如Rust）。研究证实多语言预训练能产生协同效应，尤其在语法相似的编程语言间更为显著。此外，采用并行配对策略（将代码片段与其翻译版本拼接）的预训练方式能显著增强跨语言能力，且具备良好的缩放特性。最终我们提出比例依赖型多语言缩放定律，通过优先分配高效用语言（如Python）、平衡高协同语言对（如JavaScript-TypeScript）、缩减快速饱和语言（如Rust）的标记分配，在相同算力预算下相比均匀分配策略能在所有编程语言上实现更优的平均性能。

English

Code large language models (Code LLMs) are powerful but costly to train, with scaling laws predicting performance from model size, data, and compute. However, different programming languages (PLs) have varying impacts during pre-training that significantly affect base model performance, leading to inaccurate performance prediction. Besides, existing works focus on language-agnostic settings, neglecting the inherently multilingual nature of modern software development. Therefore, it is first necessary to investigate the scaling laws of different PLs, and then consider their mutual influences to arrive at the final multilingual scaling law. In this paper, we present the first systematic exploration of scaling laws for multilingual code pre-training, conducting over 1000+ experiments (Equivalent to 336,000+ H800 hours) across multiple PLs, model sizes (0.2B to 14B parameters), and dataset sizes (1T tokens). We establish comprehensive scaling laws for code LLMs across multiple PLs, revealing that interpreted languages (e.g., Python) benefit more from increased model size and data than compiled languages (e.g., Rust). The study demonstrates that multilingual pre-training provides synergistic benefits, particularly between syntactically similar PLs. Further, the pre-training strategy of the parallel pairing (concatenating code snippets with their translations) significantly enhances cross-lingual abilities with favorable scaling properties. Finally, a proportion-dependent multilingual scaling law is proposed to optimally allocate training tokens by prioritizing high-utility PLs (e.g., Python), balancing high-synergy pairs (e.g., JavaScript-TypeScript), and reducing allocation to fast-saturating languages (Rust), achieving superior average performance across all PLs compared to uniform distribution under the same compute budget.

代码扩展定律：每种编程语言都至关重要

Scaling Laws for Code: Every Programming Language Matters

摘要

Support