코드의 확장 법칙: 모든 프로그래밍 언어가 중요하다

초록

코드 대규모 언어 모델(Code LLM)은 강력하지만 훈련 비용이 높으며, 확장 법칙은 모델 크기, 데이터, 컴퓨팅 자원으로부터 성능을 예측합니다. 그러나 다양한 프로그래밍 언어(PL)는 사전 훈련 중 서로 다른 영향을 미쳐 기본 모델 성능에 큰 차이를 만들며, 이로 인해 성능 예측이 부정확해집니다. 또한 기존 연구는 언어 중립적 설정에 집중하여 현대 소프트웨어 개발의 본질적 다국어 특성을 간과했습니다. 따라서 먼저 다양한 PL의 확장 법칙을 규명하고, 이들의 상호 영향을 고려하여 최종적인 다국어 확장 법칙을 도출해야 합니다. 본 논문에서는 다국어 코드 사전 훈련을 위한 확장 법칙에 대한 첫 체계적 탐구를 제시하며, 여러 PL, 모델 크기(0.2B~14B 매개변수), 데이터셋 크기(1T 토큰)에 걸쳐 1,000건 이상의 실험(H800 시간 기준 336,000시간 이상 상당)을 수행했습니다. 우리는 여러 PL에 걸친 코드 LLM의 포괄적 확장 법칙을 수립했으며, 인터프리터 언어(예: Python)가 컴파일 언어(예: Rust)보다 모델 크기와 데이터 증가의 혜택을 더 크게 받음을 발견했습니다. 본 연구는 다국어 사전 훈련이 특히 구문적으로 유사한 PL 간에 시너지 효과를 제공함을 입증합니다. 더 나아가, 병렬 페어링(코드 조각과 해당 번역문을 연결) 사전 훈련 전략은 유리한 확장 특성을 보이며 크로스-링구얼 능력을 크게 향상시킵니다. 마지막으로, 비례 의존적 다국어 확장 법칙을 제안하여 높은 효용성 PL(예: Python)을 우선하고, 높은 시너지 페어(예: JavaScript-TypeScript)를 균형 있게 배분하며, 빠르게 포화되는 언어(Rust)의 할당을 줄여 동일 컴퓨팅 예산 내 균일 분배보다 모든 PL에서 우수한 평균 성능을 달성합니다.

English

Code large language models (Code LLMs) are powerful but costly to train, with scaling laws predicting performance from model size, data, and compute. However, different programming languages (PLs) have varying impacts during pre-training that significantly affect base model performance, leading to inaccurate performance prediction. Besides, existing works focus on language-agnostic settings, neglecting the inherently multilingual nature of modern software development. Therefore, it is first necessary to investigate the scaling laws of different PLs, and then consider their mutual influences to arrive at the final multilingual scaling law. In this paper, we present the first systematic exploration of scaling laws for multilingual code pre-training, conducting over 1000+ experiments (Equivalent to 336,000+ H800 hours) across multiple PLs, model sizes (0.2B to 14B parameters), and dataset sizes (1T tokens). We establish comprehensive scaling laws for code LLMs across multiple PLs, revealing that interpreted languages (e.g., Python) benefit more from increased model size and data than compiled languages (e.g., Rust). The study demonstrates that multilingual pre-training provides synergistic benefits, particularly between syntactically similar PLs. Further, the pre-training strategy of the parallel pairing (concatenating code snippets with their translations) significantly enhances cross-lingual abilities with favorable scaling properties. Finally, a proportion-dependent multilingual scaling law is proposed to optimally allocate training tokens by prioritizing high-utility PLs (e.g., Python), balancing high-synergy pairs (e.g., JavaScript-TypeScript), and reducing allocation to fast-saturating languages (Rust), achieving superior average performance across all PLs compared to uniform distribution under the same compute budget.

코드의 확장 법칙: 모든 프로그래밍 언어가 중요하다

Scaling Laws for Code: Every Programming Language Matters

초록

Support