OpenCodeReasoning-II：一种基于自我批判的简单测试时间扩展方法

摘要

近期，基於推理的大型語言模型（LLMs）的進展，尤其是通過測試時擴展所展現的潛力，為代碼生成與審查中的蒸餾技術開創了重要機遇。然而，這兩方面的進展根本上依賴於大規模、高質量的數據集。在本研究中，我們引入了OpenCodeReasoning-II，這是一個包含250萬個問題-解決方案-審查三元組（約3.5萬個獨特編程問題）的數據集，使其規模幾乎是先前最大公開代碼推理數據集的兩倍。本研究採用了一種兩階段的有監督微調策略：第一階段專注於代碼生成的微調，而第二階段則涉及代碼生成與審查模型的聯合訓練。我們最終微調的Qwen2.5-Instruct模型在代碼生成上的表現，無論是超越還是持平，都達到了此前最佳開源權重蒸餾模型的水平。值得注意的是，將我們的代碼生成與審查模型相結合，顯著提升了競技編程的表現。此外，我們還擴展了LiveCodeBench基準，特別增加了對C++編程語言的支持，從而利用該基準促進了更全面的LLM評估。

English

Recent advancements in reasoning-based Large Language Models (LLMs), particularly their potential through test-time scaling, have created significant opportunities for distillation in code generation and critique. However, progress in both areas fundamentally depends on large-scale, high-quality datasets. In this work, we introduce OpenCodeReasoning-II, a dataset consists of 2.5M question-solution-critique triples (approx. 35K unique programming questions), making it nearly twice the size of the previous largest publicly available code reasoning dataset. In this work, we employ a two-stage supervised fine-tuning strategy. The first stage focuses on fine-tuning for code generation, while the second stage involves the joint training of models for both code generation and critique. Our resulting finetuned Qwen2.5-Instruct models achieve performance in code generation that either exceeds or equals the best prior open-weight distilled models. Notably, the integration of our code generation and critique models leads to significant improvements in competitive coding performance. Furthermore, we present an extension of the LiveCodeBench benchmark to specifically support the C++ programming language, thereby facilitating more comprehensive LLM evaluation using this benchmark.

OpenCodeReasoning-II：一种基于自我批判的简单测试时间扩展方法

OpenCodeReasoning-II: A Simple Test Time Scaling Approach via Self-Critique

摘要

Support