rStar-Coder：基于大规模验证数据集扩展竞争性代码推理能力

摘要

提升大语言模型（LLMs）的代码推理能力，根本上受限于高难度数据集的稀缺，尤其是那些包含可验证输入输出测试用例的数据集，这对于大规模严格验证解决方案至关重要。我们推出了rStar-Coder，通过构建一个包含418K竞赛级代码问题、580K长推理解决方案及丰富难度测试用例的大规模验证数据集，显著提升了LLM的代码推理能力。这一成就得益于三大核心贡献：（1）我们精选竞赛编程代码问题与标准解答，合成新的可解问题；（2）引入了一个可靠的输入输出测试用例合成流程，将生成过程分解为三步输入生成方法及用于有效输出标注的相互验证机制；（3）我们为问题补充了高质量、经测试用例验证的长推理解决方案。在Qwen模型（1.5B-14B）上进行的多项代码推理基准测试中，rStar-Coder数据集展现了其优越性，以更小的模型规模实现了与前沿推理LLM相媲美的领先性能。在LiveCodeBench上，rStar-Coder将Qwen2.5-7B的得分从17.4%提升至惊人的57.3%，Qwen2.5-14B从23.3%提升至62.5%，超越了o3-mini（低）3.1%。在更具挑战性的美国计算机奥林匹克竞赛中，我们的7B模型实现了16.15%的平均pass@1准确率，优于前沿级别的QWQ-32B。代码及数据集将在https://github.com/microsoft/rStar 发布。

English

Advancing code reasoning in large language models (LLMs) is fundamentally limited by the scarcity of high-difficulty datasets, especially those with verifiable input-output test cases necessary for rigorous solution validation at scale. We introduce rStar-Coder, which significantly improves LLM code reasoning capabilities by constructing a large-scale, verified dataset of 418K competition-level code problems, 580K long-reasoning solutions along with rich test cases of varying difficulty. This is achieved through three core contributions: (1) we curate competitive programming code problems and oracle solutions to synthesize new, solvable problems; (2) we introduce a reliable input-output test case synthesis pipeline that decouples the generation into a three-step input generation method and a mutual verification mechanism for effective output labeling; (3) we augment problems with high-quality, test-case-verified long-reasoning solutions. Extensive experiments on Qwen models (1.5B-14B) across various code reasoning benchmarks demonstrate the superiority of rStar-Coder dataset, achieving leading performance comparable to frontier reasoning LLMs with much smaller model sizes. On LiveCodeBench, rStar-Coder improves Qwen2.5-7B from 17.4% to an impressive 57.3%, and Qwen2.5-14B from 23.3% to 62.5%, surpassing o3-mini (low) by3.1%. On the more challenging USA Computing Olympiad, our 7B model achieves an average pass@1 accuracy of 16.15%, outperforming the frontier-level QWQ-32B. Code and the dataset will be released at https://github.com/microsoft/rStar.

rStar-Coder：基于大规模验证数据集扩展竞争性代码推理能力

rStar-Coder: Scaling Competitive Code Reasoning with a Large-Scale Verified Dataset

摘要

Support