OpenCodeReasoning-II:一种基于自我批判的简单测试时间扩展方法
OpenCodeReasoning-II: A Simple Test Time Scaling Approach via Self-Critique
July 11, 2025
作者: Wasi Uddin Ahmad, Somshubra Majumdar, Aleksander Ficek, Sean Narenthiran, Mehrzad Samadi, Jocelyn Huang, Siddhartha Jain, Vahid Noroozi, Boris Ginsburg
cs.AI
摘要
近期,基於推理的大型語言模型(LLMs)的進展,尤其是通過測試時擴展所展現的潛力,為代碼生成與審查中的蒸餾技術開創了重要機遇。然而,這兩方面的進展根本上依賴於大規模、高質量的數據集。在本研究中,我們引入了OpenCodeReasoning-II,這是一個包含250萬個問題-解決方案-審查三元組(約3.5萬個獨特編程問題)的數據集,使其規模幾乎是先前最大公開代碼推理數據集的兩倍。本研究採用了一種兩階段的有監督微調策略:第一階段專注於代碼生成的微調,而第二階段則涉及代碼生成與審查模型的聯合訓練。我們最終微調的Qwen2.5-Instruct模型在代碼生成上的表現,無論是超越還是持平,都達到了此前最佳開源權重蒸餾模型的水平。值得注意的是,將我們的代碼生成與審查模型相結合,顯著提升了競技編程的表現。此外,我們還擴展了LiveCodeBench基準,特別增加了對C++編程語言的支持,從而利用該基準促進了更全面的LLM評估。
English
Recent advancements in reasoning-based Large Language Models (LLMs),
particularly their potential through test-time scaling, have created
significant opportunities for distillation in code generation and critique.
However, progress in both areas fundamentally depends on large-scale,
high-quality datasets. In this work, we introduce OpenCodeReasoning-II, a
dataset consists of 2.5M question-solution-critique triples (approx. 35K unique
programming questions), making it nearly twice the size of the previous largest
publicly available code reasoning dataset. In this work, we employ a two-stage
supervised fine-tuning strategy. The first stage focuses on fine-tuning for
code generation, while the second stage involves the joint training of models
for both code generation and critique. Our resulting finetuned Qwen2.5-Instruct
models achieve performance in code generation that either exceeds or equals the
best prior open-weight distilled models. Notably, the integration of our code
generation and critique models leads to significant improvements in competitive
coding performance. Furthermore, we present an extension of the LiveCodeBench
benchmark to specifically support the C++ programming language, thereby
facilitating more comprehensive LLM evaluation using this benchmark.