KodCode: 多様性に富み、挑戦的で検証可能な合成コーディングデータセット

要旨

我々は、コーディング用大規模言語モデルの訓練において、多様な難易度と領域にわたる高品質で検証可能な訓練データを取得するという持続的な課題に対処するため、合成データセットKodCodeを紹介する。既存のコード中心のリソースは、通常、カバレッジの広さ（例えば、簡単なコーディングタスクから高度なアルゴリズム問題まで）または検証可能な正確さ（例えば、ユニットテスト）のいずれかを確保することに失敗している。対照的に、KodCodeは、自己検証手順を通じて体系的に検証された問題-解決策-テストのトリプレットで構成されている。我々のパイプラインは、まず幅広いコーディング問題を合成し、次に解決策とテストケースを生成し、難しい問題に対して追加の試みを割り当てる。最後に、訓練後のデータ合成は、問題を多様な形式に書き換え、推論モデル（DeepSeek R1）からのテストベースのリジェクトサンプリング手順の下で応答を生成することによって行われる。このパイプラインにより、大規模で堅牢かつ多様なコーディングデータセットが得られる。KodCodeは教師あり微調整に適しており、ペアになったユニットテストはRLチューニングの大きな可能性も提供する。コーディングベンチマーク（HumanEval(+), MBPP(+), BigCodeBench, LiveCodeBench）での微調整実験により、KodCodeで調整されたモデルが、Qwen2.5-Coder-32B-InstructやDeepSeek-R1-Distill-Llama-70Bなどのモデルを超える最先端の性能を達成することが示された。

English

We introduce KodCode, a synthetic dataset that addresses the persistent challenge of acquiring high-quality, verifiable training data across diverse difficulties and domains for training Large Language Models for coding. Existing code-focused resources typically fail to ensure either the breadth of coverage (e.g., spanning simple coding tasks to advanced algorithmic problems) or verifiable correctness (e.g., unit tests). In contrast, KodCode comprises question-solution-test triplets that are systematically validated via a self-verification procedure. Our pipeline begins by synthesizing a broad range of coding questions, then generates solutions and test cases with additional attempts allocated to challenging problems. Finally, post-training data synthesis is done by rewriting questions into diverse formats and generating responses under a test-based reject sampling procedure from a reasoning model (DeepSeek R1). This pipeline yields a large-scale, robust and diverse coding dataset. KodCode is suitable for supervised fine-tuning and the paired unit tests also provide great potential for RL tuning. Fine-tuning experiments on coding benchmarks (HumanEval(+), MBPP(+), BigCodeBench, and LiveCodeBench) demonstrate that KodCode-tuned models achieve state-of-the-art performance, surpassing models like Qwen2.5-Coder-32B-Instruct and DeepSeek-R1-Distill-Llama-70B.

KodCode: 多様性に富み、挑戦的で検証可能な合成コーディングデータセット

KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding

要旨

Support