rStar-Coder: 大規模検証済みデータセットを用いた競争力のあるコード推論のスケーリング

要旨

大規模言語モデル（LLM）におけるコード推論能力の向上は、特に大規模な厳密なソリューション検証に必要な検証可能な入力-出力テストケースを伴う高難易度データセットの不足によって根本的に制限されています。本論文では、rStar-Coderを紹介します。これは、418Kの競技プログラミングレベルのコード問題、580Kの長文推論ソリューション、およびさまざまな難易度の豊富なテストケースを含む大規模な検証済みデータセットを構築することで、LLMのコード推論能力を大幅に向上させます。これは、以下の3つのコアな貢献によって達成されます：(1) 競技プログラミングのコード問題とオラクルソリューションをキュレーションし、新しい解決可能な問題を合成します；(2) 信頼性の高い入力-出力テストケース合成パイプラインを導入し、生成を3段階の入力生成方法と効果的な出力ラベリングのための相互検証メカニズムに分離します；(3) 高品質なテストケース検証済みの長文推論ソリューションで問題を拡張します。Qwenモデル（1.5B-14B）を用いたさまざまなコード推論ベンチマークでの広範な実験により、rStar-Coderデータセットの優位性が実証され、はるかに小さいモデルサイズで最先端の推論LLMに匹敵するリーディングパフォーマンスを達成しました。LiveCodeBenchでは、rStar-CoderはQwen2.5-7Bを17.4%から印象的な57.3%に、Qwen2.5-14Bを23.3%から62.5%に改善し、o3-mini（low）を3.1%上回りました。より挑戦的なUSA Computing Olympiadでは、7Bモデルが平均pass@1精度16.15%を達成し、最先端レベルのQWQ-32Bを上回りました。コードとデータセットはhttps://github.com/microsoft/rStarで公開されます。

English

Advancing code reasoning in large language models (LLMs) is fundamentally limited by the scarcity of high-difficulty datasets, especially those with verifiable input-output test cases necessary for rigorous solution validation at scale. We introduce rStar-Coder, which significantly improves LLM code reasoning capabilities by constructing a large-scale, verified dataset of 418K competition-level code problems, 580K long-reasoning solutions along with rich test cases of varying difficulty. This is achieved through three core contributions: (1) we curate competitive programming code problems and oracle solutions to synthesize new, solvable problems; (2) we introduce a reliable input-output test case synthesis pipeline that decouples the generation into a three-step input generation method and a mutual verification mechanism for effective output labeling; (3) we augment problems with high-quality, test-case-verified long-reasoning solutions. Extensive experiments on Qwen models (1.5B-14B) across various code reasoning benchmarks demonstrate the superiority of rStar-Coder dataset, achieving leading performance comparable to frontier reasoning LLMs with much smaller model sizes. On LiveCodeBench, rStar-Coder improves Qwen2.5-7B from 17.4% to an impressive 57.3%, and Qwen2.5-14B from 23.3% to 62.5%, surpassing o3-mini (low) by3.1%. On the more challenging USA Computing Olympiad, our 7B model achieves an average pass@1 accuracy of 16.15%, outperforming the frontier-level QWQ-32B. Code and the dataset will be released at https://github.com/microsoft/rStar.

rStar-Coder: 大規模検証済みデータセットを用いた競争力のあるコード推論のスケーリング

rStar-Coder: Scaling Competitive Code Reasoning with a Large-Scale Verified Dataset

要旨

Support