ChatPaper.aiChatPaper

rStar-Coder:基於大規模驗證數據集的競爭性代碼推理擴展

rStar-Coder: Scaling Competitive Code Reasoning with a Large-Scale Verified Dataset

May 27, 2025
作者: Yifei Liu, Li Lyna Zhang, Yi Zhu, Bingcheng Dong, Xudong Zhou, Ning Shang, Fan Yang, Mao Yang
cs.AI

摘要

提升大型語言模型(LLMs)在代碼推理方面的能力,根本上受限於高難度數據集的稀缺性,尤其是那些具備可驗證輸入輸出測試案例的數據集,這些對於大規模嚴格驗證解決方案至關重要。我們引入了rStar-Coder,通過構建一個包含418K競賽級代碼問題、580K長推理解決方案以及豐富多樣難度測試案例的大規模驗證數據集,顯著提升了LLM的代碼推理能力。這一成就基於三項核心貢獻:(1)我們精心挑選競賽編程代碼問題及其標準解決方案,以合成新的可解問題;(2)我們引入了一個可靠的輸入輸出測試案例合成管道,將生成過程分解為三步輸入生成方法及相互驗證機制,以實現有效的輸出標註;(3)我們為問題配備了高質量、經測試案例驗證的長推理解決方案。在Qwen模型(1.5B-14B)上進行的廣泛實驗,涵蓋多種代碼推理基準測試,證明了rStar-Coder數據集的優越性,其表現可與前沿推理LLMs相媲美,而模型規模卻小得多。在LiveCodeBench上,rStar-Coder將Qwen2.5-7B的表現從17.4%提升至令人印象深刻的57.3%,Qwen2.5-14B從23.3%提升至62.5%,超越了o3-mini(低)3.1%。在更具挑戰性的美國計算奧林匹克競賽中,我們的7B模型實現了平均16.15%的pass@1準確率,優於前沿級別的QWQ-32B。代碼及數據集將於https://github.com/microsoft/rStar發布。
English
Advancing code reasoning in large language models (LLMs) is fundamentally limited by the scarcity of high-difficulty datasets, especially those with verifiable input-output test cases necessary for rigorous solution validation at scale. We introduce rStar-Coder, which significantly improves LLM code reasoning capabilities by constructing a large-scale, verified dataset of 418K competition-level code problems, 580K long-reasoning solutions along with rich test cases of varying difficulty. This is achieved through three core contributions: (1) we curate competitive programming code problems and oracle solutions to synthesize new, solvable problems; (2) we introduce a reliable input-output test case synthesis pipeline that decouples the generation into a three-step input generation method and a mutual verification mechanism for effective output labeling; (3) we augment problems with high-quality, test-case-verified long-reasoning solutions. Extensive experiments on Qwen models (1.5B-14B) across various code reasoning benchmarks demonstrate the superiority of rStar-Coder dataset, achieving leading performance comparable to frontier reasoning LLMs with much smaller model sizes. On LiveCodeBench, rStar-Coder improves Qwen2.5-7B from 17.4% to an impressive 57.3%, and Qwen2.5-14B from 23.3% to 62.5%, surpassing o3-mini (low) by3.1%. On the more challenging USA Computing Olympiad, our 7B model achieves an average pass@1 accuracy of 16.15%, outperforming the frontier-level QWQ-32B. Code and the dataset will be released at https://github.com/microsoft/rStar.

Summary

AI-Generated Summary

PDF254May 28, 2025