OpenCodeReasoning-II: 自己批判によるシンプルなテスト時スケーリングアプローチ

要旨

推論ベースの大規模言語モデル（LLMs）における最近の進展、特にテスト時のスケーリングを通じた潜在能力は、コード生成と批評における蒸留に重要な機会をもたらしました。しかし、これらの領域における進展は、大規模で高品質なデータセットに根本的に依存しています。本研究では、OpenCodeReasoning-IIを紹介します。このデータセットは250万の質問-解決策-批評のトリプル（約3万5千のユニークなプログラミング質問）で構成されており、これまでに公開されていた最大のコード推論データセットのほぼ2倍の規模です。本研究では、2段階の教師ありファインチューニング戦略を採用しています。第1段階ではコード生成に焦点を当てたファインチューニングを行い、第2段階ではコード生成と批評の両方のモデルを共同でトレーニングします。その結果、ファインチューニングされたQwen2.5-Instructモデルは、コード生成においてこれまでの最良のオープンウェイト蒸留モデルを上回るか同等の性能を達成しました。特に、コード生成モデルと批評モデルの統合により、競技プログラミングのパフォーマンスが大幅に向上しました。さらに、LiveCodeBenchベンチマークを拡張し、C++プログラミング言語を特にサポートすることで、このベンチマークを使用したより包括的なLLM評価を可能にしました。

English

Recent advancements in reasoning-based Large Language Models (LLMs), particularly their potential through test-time scaling, have created significant opportunities for distillation in code generation and critique. However, progress in both areas fundamentally depends on large-scale, high-quality datasets. In this work, we introduce OpenCodeReasoning-II, a dataset consists of 2.5M question-solution-critique triples (approx. 35K unique programming questions), making it nearly twice the size of the previous largest publicly available code reasoning dataset. In this work, we employ a two-stage supervised fine-tuning strategy. The first stage focuses on fine-tuning for code generation, while the second stage involves the joint training of models for both code generation and critique. Our resulting finetuned Qwen2.5-Instruct models achieve performance in code generation that either exceeds or equals the best prior open-weight distilled models. Notably, the integration of our code generation and critique models leads to significant improvements in competitive coding performance. Furthermore, we present an extension of the LiveCodeBench benchmark to specifically support the C++ programming language, thereby facilitating more comprehensive LLM evaluation using this benchmark.

OpenCodeReasoning-II: 自己批判によるシンプルなテスト時スケーリングアプローチ

OpenCodeReasoning-II: A Simple Test Time Scaling Approach via Self-Critique

要旨

Support