OpenCodeReasoning-II:一种基于自我批判的简单测试时扩展方法
OpenCodeReasoning-II: A Simple Test Time Scaling Approach via Self-Critique
July 11, 2025
作者: Wasi Uddin Ahmad, Somshubra Majumdar, Aleksander Ficek, Sean Narenthiran, Mehrzad Samadi, Jocelyn Huang, Siddhartha Jain, Vahid Noroozi, Boris Ginsburg
cs.AI
摘要
近期,基于推理的大型语言模型(LLMs)取得了显著进展,特别是在测试时扩展方面的潜力,为代码生成与评审的蒸馏技术开辟了重要机遇。然而,这两方面的进步从根本上依赖于大规模、高质量的数据集。在本研究中,我们推出了OpenCodeReasoning-II,一个包含250万条问题-解决方案-评审三元组(约3.5万个独特编程问题)的数据集,其规模几乎是之前公开的最大代码推理数据集的两倍。本研究采用了两阶段监督微调策略:第一阶段专注于代码生成的微调,而第二阶段则涉及代码生成与评审模型的联合训练。经过微调的Qwen2.5-Instruct模型在代码生成上的表现,不仅超越或持平了之前最佳的公开权重蒸馏模型,而且将代码生成与评审模型相结合,显著提升了竞技编程的表现。此外,我们还扩展了LiveCodeBench基准测试,特别增加了对C++编程语言的支持,从而利用该基准实现了对LLM更全面的评估。
English
Recent advancements in reasoning-based Large Language Models (LLMs),
particularly their potential through test-time scaling, have created
significant opportunities for distillation in code generation and critique.
However, progress in both areas fundamentally depends on large-scale,
high-quality datasets. In this work, we introduce OpenCodeReasoning-II, a
dataset consists of 2.5M question-solution-critique triples (approx. 35K unique
programming questions), making it nearly twice the size of the previous largest
publicly available code reasoning dataset. In this work, we employ a two-stage
supervised fine-tuning strategy. The first stage focuses on fine-tuning for
code generation, while the second stage involves the joint training of models
for both code generation and critique. Our resulting finetuned Qwen2.5-Instruct
models achieve performance in code generation that either exceeds or equals the
best prior open-weight distilled models. Notably, the integration of our code
generation and critique models leads to significant improvements in competitive
coding performance. Furthermore, we present an extension of the LiveCodeBench
benchmark to specifically support the C++ programming language, thereby
facilitating more comprehensive LLM evaluation using this benchmark.