rStar-Coder: 대규모 검증된 데이터셋을 통한 경쟁력 있는 코드 추론의 확장

초록

대규모 언어 모델(LLM)의 코드 추론 능력을 발전시키는 데 있어 근본적인 한계는 고난이도 데이터셋, 특히 대규모로 엄격한 솔루션 검증을 위해 검증 가능한 입력-출력 테스트 케이스가 포함된 데이터셋의 부족입니다. 우리는 rStar-Coder를 소개하며, 이는 418K의 경쟁 수준 코드 문제, 580K의 장기 추론 솔루션, 그리고 다양한 난이도의 풍부한 테스트 케이스로 구성된 대규모 검증 데이터셋을 구축함으로써 LLM의 코드 추론 능력을 크게 향상시킵니다. 이는 세 가지 핵심 기여를 통해 달성되었습니다: (1) 경쟁 프로그래밍 코드 문제와 오라클 솔루션을 선별하여 새로운 해결 가능한 문제를 합성합니다; (2) 입력-출력 테스트 케이스 합성을 위한 신뢰할 수 있는 파이프라인을 도입하여 생성 과정을 세 단계 입력 생성 방법과 효과적인 출력 라벨링을 위한 상호 검증 메커니즘으로 분리합니다; (3) 테스트 케이스 검증된 고품질의 장기 추론 솔루션으로 문제를 보강합니다. Qwen 모델(1.5B-14B)을 다양한 코드 추론 벤치마크에서 광범위하게 실험한 결과, rStar-Coder 데이터셋의 우수성이 입증되었으며, 훨씬 작은 모델 크기로도 최첨단 추론 LLM에 필적하는 성능을 달성했습니다. LiveCodeBench에서 rStar-Coder는 Qwen2.5-7B를 17.4%에서 인상적인 57.3%로, Qwen2.5-14B를 23.3%에서 62.5%로 향상시켜 o3-mini (low)를 3.1% 앞섰습니다. 더 도전적인 USA Computing Olympiad에서 우리의 7B 모델은 평균 pass@1 정확도 16.15%를 달성하며, 최첨단 QWQ-32B를 능가했습니다. 코드와 데이터셋은 https://github.com/microsoft/rStar에서 공개될 예정입니다.

English

Advancing code reasoning in large language models (LLMs) is fundamentally limited by the scarcity of high-difficulty datasets, especially those with verifiable input-output test cases necessary for rigorous solution validation at scale. We introduce rStar-Coder, which significantly improves LLM code reasoning capabilities by constructing a large-scale, verified dataset of 418K competition-level code problems, 580K long-reasoning solutions along with rich test cases of varying difficulty. This is achieved through three core contributions: (1) we curate competitive programming code problems and oracle solutions to synthesize new, solvable problems; (2) we introduce a reliable input-output test case synthesis pipeline that decouples the generation into a three-step input generation method and a mutual verification mechanism for effective output labeling; (3) we augment problems with high-quality, test-case-verified long-reasoning solutions. Extensive experiments on Qwen models (1.5B-14B) across various code reasoning benchmarks demonstrate the superiority of rStar-Coder dataset, achieving leading performance comparable to frontier reasoning LLMs with much smaller model sizes. On LiveCodeBench, rStar-Coder improves Qwen2.5-7B from 17.4% to an impressive 57.3%, and Qwen2.5-14B from 23.3% to 62.5%, surpassing o3-mini (low) by3.1%. On the more challenging USA Computing Olympiad, our 7B model achieves an average pass@1 accuracy of 16.15%, outperforming the frontier-level QWQ-32B. Code and the dataset will be released at https://github.com/microsoft/rStar.

rStar-Coder: 대규모 검증된 데이터셋을 통한 경쟁력 있는 코드 추론의 확장

rStar-Coder: Scaling Competitive Code Reasoning with a Large-Scale Verified Dataset

초록

Support