KodCode: 다양하고 도전적이며 검증 가능한 코딩용 합성 데이터셋

초록

우리는 코딩을 위한 대규모 언어 모델(Large Language Models)을 훈련시키기 위해 다양한 난이도와 도메인에 걸쳐 고품질의 검증 가능한 훈련 데이터를 확보하는 지속적인 과제를 해결하기 위해 KodCode라는 합성 데이터셋을 소개합니다. 기존의 코드 중심 리소스는 일반적으로 커버리지의 폭(예: 간단한 코딩 작업부터 고급 알고리즘 문제까지)이나 검증 가능한 정확성(예: 단위 테스트) 중 하나를 보장하지 못합니다. 반면, KodCode는 체계적인 자체 검증 절차를 통해 검증된 질문-해결책-테스트 삼중항으로 구성됩니다. 우리의 파이프라인은 먼저 다양한 코딩 질문을 합성한 다음, 해결책과 테스트 케이스를 생성하며, 특히 어려운 문제에는 추가적인 시도를 할당합니다. 마지막으로, 사후 훈련 데이터 합성은 질문을 다양한 형식으로 재작성하고, 추론 모델(DeepSeek R1)에서 테스트 기반 거부 샘플링 절차를 통해 응답을 생성함으로써 수행됩니다. 이 파이프라인은 대규모의 견고하고 다양한 코딩 데이터셋을 산출합니다. KodCode는 지도 학습 미세 조정에 적합하며, 짝을 이루는 단위 테스트는 강화 학습(RL) 튜닝에도 큰 잠재력을 제공합니다. 코딩 벤치마크(HumanEval(+), MBPP(+), BigCodeBench, LiveCodeBench)에서의 미세 조정 실험은 KodCode로 튜닝된 모델이 Qwen2.5-Coder-32B-Instruct 및 DeepSeek-R1-Distill-Llama-70B와 같은 모델을 능가하는 최첨단 성능을 달성함을 보여줍니다.

English

We introduce KodCode, a synthetic dataset that addresses the persistent challenge of acquiring high-quality, verifiable training data across diverse difficulties and domains for training Large Language Models for coding. Existing code-focused resources typically fail to ensure either the breadth of coverage (e.g., spanning simple coding tasks to advanced algorithmic problems) or verifiable correctness (e.g., unit tests). In contrast, KodCode comprises question-solution-test triplets that are systematically validated via a self-verification procedure. Our pipeline begins by synthesizing a broad range of coding questions, then generates solutions and test cases with additional attempts allocated to challenging problems. Finally, post-training data synthesis is done by rewriting questions into diverse formats and generating responses under a test-based reject sampling procedure from a reasoning model (DeepSeek R1). This pipeline yields a large-scale, robust and diverse coding dataset. KodCode is suitable for supervised fine-tuning and the paired unit tests also provide great potential for RL tuning. Fine-tuning experiments on coding benchmarks (HumanEval(+), MBPP(+), BigCodeBench, and LiveCodeBench) demonstrate that KodCode-tuned models achieve state-of-the-art performance, surpassing models like Qwen2.5-Coder-32B-Instruct and DeepSeek-R1-Distill-Llama-70B.

KodCode: 다양하고 도전적이며 검증 가능한 코딩용 합성 데이터셋

KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding

초록

Support