KodCode：一個多樣化、具挑戰性且可驗證的程式碼合成數據集

摘要

我們推出KodCode，這是一個合成數據集，旨在解決為訓練大型語言模型進行編碼時，獲取高質量、可驗證的跨難度和跨領域訓練數據的持續挑戰。現有的代碼相關資源通常無法確保覆蓋範圍的廣度（例如，從簡單的編碼任務到高級算法問題）或可驗證的正確性（例如，單元測試）。相比之下，KodCode包含通過自我驗證程序系統性驗證的問題-解決方案-測試三元組。我們的流程首先合成廣泛的編碼問題，然後生成解決方案和測試案例，並為難題分配額外的嘗試。最後，通過將問題改寫為多種格式並從推理模型（DeepSeek R1）基於測試的拒絕採樣程序中生成響應，完成訓練後數據的合成。這一流程產生了規模大、魯棒且多樣化的編碼數據集。KodCode適用於監督微調，配對的單元測試也為強化學習調優提供了巨大潛力。在編碼基準（HumanEval(+)、MBPP(+)、BigCodeBench和LiveCodeBench）上的微調實驗表明，基於KodCode微調的模型達到了最先進的性能，超越了如Qwen2.5-Coder-32B-Instruct和DeepSeek-R1-Distill-Llama-70B等模型。

English

We introduce KodCode, a synthetic dataset that addresses the persistent challenge of acquiring high-quality, verifiable training data across diverse difficulties and domains for training Large Language Models for coding. Existing code-focused resources typically fail to ensure either the breadth of coverage (e.g., spanning simple coding tasks to advanced algorithmic problems) or verifiable correctness (e.g., unit tests). In contrast, KodCode comprises question-solution-test triplets that are systematically validated via a self-verification procedure. Our pipeline begins by synthesizing a broad range of coding questions, then generates solutions and test cases with additional attempts allocated to challenging problems. Finally, post-training data synthesis is done by rewriting questions into diverse formats and generating responses under a test-based reject sampling procedure from a reasoning model (DeepSeek R1). This pipeline yields a large-scale, robust and diverse coding dataset. KodCode is suitable for supervised fine-tuning and the paired unit tests also provide great potential for RL tuning. Fine-tuning experiments on coding benchmarks (HumanEval(+), MBPP(+), BigCodeBench, and LiveCodeBench) demonstrate that KodCode-tuned models achieve state-of-the-art performance, surpassing models like Qwen2.5-Coder-32B-Instruct and DeepSeek-R1-Distill-Llama-70B.

KodCode：一個多樣化、具挑戰性且可驗證的程式碼合成數據集

KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding

摘要

Support