OpenCoder：トップティアコード大規模言語モデルのためのオープンクックブック

要旨

コード用大規模言語モデル（LLMs）は、コード生成、推論タスク、エージェントシステムなど、さまざまな領域で不可欠な存在となっている。オープンアクセスのコードLLMsは、プロプライエタリモデルの性能レベルにますます近づきつつあるが、厳密な科学的調査に適した高品質なコードLLMs、特に再現可能なデータ処理パイプラインと透明性のあるトレーニングプロトコルを備えたものは、依然として限られている。この不足は、リソースの制約、倫理的考慮事項、モデルの先進性を維持するための競争上の優位性など、さまざまな課題に起因している。このギャップを埋めるため、我々はOpenCoderを紹介する。これは、主要なモデルに匹敵する性能を達成するだけでなく、研究コミュニティにとっての「オープンなクックブック」としても機能するトップクラスのコードLLMである。これまでのほとんどの取り組みとは異なり、我々はモデルの重みと推論コードだけでなく、再現可能なトレーニングデータ、完全なデータ処理パイプライン、厳密な実験的アブレーション結果、そしてオープンな科学研究のための詳細なトレーニングプロトコルも公開する。この包括的な公開を通じて、トップクラスのコードLLMを構築するための重要な要素を特定した：（1）コード最適化されたヒューリスティックルールによるデータクリーニングとデータ重複排除の方法、（2）コードに関連するテキストコーパスのリコール、（3）アニーリング段階と教師あり微調整段階の両方における高品質な合成データ。このレベルのオープン性を提供することで、我々はトップクラスのコードLLMのすべての側面へのアクセスを広げ、OpenCoderが強力なモデルとしてだけでなく、研究を加速し、コードAIにおける再現可能な進歩を可能にするオープンな基盤としても機能することを目指している。

English

Large language models (LLMs) for code have become indispensable in various domains, including code generation, reasoning tasks and agent systems.While open-access code LLMs are increasingly approaching the performance levels of proprietary models, high-quality code LLMs suitable for rigorous scientific investigation, particularly those with reproducible data processing pipelines and transparent training protocols, remain limited. The scarcity is due to various challenges, including resource constraints, ethical considerations, and the competitive advantages of keeping models advanced. To address the gap, we introduce OpenCoder, a top-tier code LLM that not only achieves performance comparable to leading models but also serves as an ``open cookbook'' for the research community. Unlike most prior efforts, we release not only model weights and inference code, but also the reproducible training data, complete data processing pipeline, rigorous experimental ablation results, and detailed training protocols for open scientific research. Through this comprehensive release, we identify the key ingredients for building a top-tier code LLM: (1) code optimized heuristic rules for data cleaning and methods for data deduplication, (2) recall of text corpus related to code and (3) high-quality synthetic data in both annealing and supervised fine-tuning stages. By offering this level of openness, we aim to broaden access to all aspects of a top-tier code LLM, with OpenCoder serving as both a powerful model and an open foundation to accelerate research, and enable reproducible advancements in code AI.

OpenCoder：トップティアコード大規模言語モデルのためのオープンクックブック

OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models

要旨

Support