代码扩展定律：每种编程语言都至关重要

摘要

代码大语言模型（Code LLM）虽功能强大但训练成本高昂，其性能通常可通过模型规模、数据量和计算资源的缩放定律进行预测。然而，不同编程语言在预训练阶段产生的差异性影响会显著改变基础模型性能，导致现有预测方法失准。此外，现有研究多关注语言无关场景，忽视了现代软件开发本质上具有的多语言特性。因此，需先探究不同编程语言的独立缩放规律，再考量其相互影响以建立最终的多语言缩放定律。本文首次系统性地探索多语言代码预训练的缩放定律，通过开展超过1000次实验（等效于336,000+ H800显卡小时），覆盖多种编程语言、模型规模（0.2B至14B参数）及数据集规模（1T标记）。我们建立了跨编程语言的完整缩放定律体系，发现解释型语言（如Python）相较编译型语言（如Rust）更能从模型规模与数据量提升中获益。研究证实多语言预训练能产生协同效应，尤其在语法相似的编程语言间更为显著。进一步地，采用并行配对策略（将代码片段与其翻译版本拼接训练）可显著增强模型的跨语言能力，且该策略具备良好的缩放特性。最终，我们提出比例依赖型多语言缩放定律，通过优先分配资源给高效用语言（如Python）、平衡高协同语言对（如JavaScript-TypeScript）、缩减快速饱和语言（如Rust）的配额，在相同计算预算下实现了优于均匀分配策略的整体平均性能。

English

Code large language models (Code LLMs) are powerful but costly to train, with scaling laws predicting performance from model size, data, and compute. However, different programming languages (PLs) have varying impacts during pre-training that significantly affect base model performance, leading to inaccurate performance prediction. Besides, existing works focus on language-agnostic settings, neglecting the inherently multilingual nature of modern software development. Therefore, it is first necessary to investigate the scaling laws of different PLs, and then consider their mutual influences to arrive at the final multilingual scaling law. In this paper, we present the first systematic exploration of scaling laws for multilingual code pre-training, conducting over 1000+ experiments (Equivalent to 336,000+ H800 hours) across multiple PLs, model sizes (0.2B to 14B parameters), and dataset sizes (1T tokens). We establish comprehensive scaling laws for code LLMs across multiple PLs, revealing that interpreted languages (e.g., Python) benefit more from increased model size and data than compiled languages (e.g., Rust). The study demonstrates that multilingual pre-training provides synergistic benefits, particularly between syntactically similar PLs. Further, the pre-training strategy of the parallel pairing (concatenating code snippets with their translations) significantly enhances cross-lingual abilities with favorable scaling properties. Finally, a proportion-dependent multilingual scaling law is proposed to optimally allocate training tokens by prioritizing high-utility PLs (e.g., Python), balancing high-synergy pairs (e.g., JavaScript-TypeScript), and reducing allocation to fast-saturating languages (Rust), achieving superior average performance across all PLs compared to uniform distribution under the same compute budget.

代码扩展定律：每种编程语言都至关重要

Scaling Laws for Code: Every Programming Language Matters

摘要

Support