ChatPaper.aiChatPaper

OpenCodeReasoning-II:一种基于自我批判的简单测试时扩展方法

OpenCodeReasoning-II: A Simple Test Time Scaling Approach via Self-Critique

July 11, 2025
作者: Wasi Uddin Ahmad, Somshubra Majumdar, Aleksander Ficek, Sean Narenthiran, Mehrzad Samadi, Jocelyn Huang, Siddhartha Jain, Vahid Noroozi, Boris Ginsburg
cs.AI

摘要

近期,基于推理的大型语言模型(LLMs)取得了显著进展,特别是在测试时扩展方面的潜力,为代码生成与评审的蒸馏技术开辟了重要机遇。然而,这两方面的进步从根本上依赖于大规模、高质量的数据集。在本研究中,我们推出了OpenCodeReasoning-II,一个包含250万条问题-解决方案-评审三元组(约3.5万个独特编程问题)的数据集,其规模几乎是之前公开的最大代码推理数据集的两倍。本研究采用了两阶段监督微调策略:第一阶段专注于代码生成的微调,而第二阶段则涉及代码生成与评审模型的联合训练。经过微调的Qwen2.5-Instruct模型在代码生成上的表现,不仅超越或持平了之前最佳的公开权重蒸馏模型,而且将代码生成与评审模型相结合,显著提升了竞技编程的表现。此外,我们还扩展了LiveCodeBench基准测试,特别增加了对C++编程语言的支持,从而利用该基准实现了对LLM更全面的评估。
English
Recent advancements in reasoning-based Large Language Models (LLMs), particularly their potential through test-time scaling, have created significant opportunities for distillation in code generation and critique. However, progress in both areas fundamentally depends on large-scale, high-quality datasets. In this work, we introduce OpenCodeReasoning-II, a dataset consists of 2.5M question-solution-critique triples (approx. 35K unique programming questions), making it nearly twice the size of the previous largest publicly available code reasoning dataset. In this work, we employ a two-stage supervised fine-tuning strategy. The first stage focuses on fine-tuning for code generation, while the second stage involves the joint training of models for both code generation and critique. Our resulting finetuned Qwen2.5-Instruct models achieve performance in code generation that either exceeds or equals the best prior open-weight distilled models. Notably, the integration of our code generation and critique models leads to significant improvements in competitive coding performance. Furthermore, we present an extension of the LiveCodeBench benchmark to specifically support the C++ programming language, thereby facilitating more comprehensive LLM evaluation using this benchmark.
PDF51July 16, 2025