OpenCodeReasoning-II: 자기 비판을 통한 간단한 테스트 시간 확장 접근법

초록

최근 추론 기반 대형 언어 모델(LLMs)의 발전, 특히 테스트 시간 스케일링을 통한 잠재력은 코드 생성 및 비평 분야에서의 지식 증류에 상당한 기회를 창출했습니다. 그러나 두 분야의 진전은 근본적으로 대규모 고품질 데이터셋에 의존합니다. 본 연구에서는 250만 개의 질문-해결책-비평 삼중항(약 3만 5천 개의 고유 프로그래밍 문제)으로 구성된 OpenCodeReasoning-II 데이터셋을 소개합니다. 이는 이전에 공개된 가장 큰 코드 추론 데이터셋의 거의 두 배에 해당하는 규모입니다. 본 연구에서는 두 단계의 지도 미세 조정 전략을 사용합니다. 첫 번째 단계는 코드 생성을 위한 미세 조정에 초점을 맞추고, 두 번째 단계는 코드 생성과 비평을 위한 모델의 공동 학습을 포함합니다. 그 결과 미세 조정된 Qwen2.5-Instruct 모델은 코드 생성 성능에서 이전 최고의 오픈 가중치 증류 모델을 능가하거나 동등한 성능을 달성했습니다. 특히, 코드 생성과 비평 모델의 통합은 경쟁 프로그래밍 성능에서 상당한 개선을 이끌어냈습니다. 또한, C++ 프로그래밍 언어를 특별히 지원하기 위해 LiveCodeBench 벤치마크를 확장하여 이 벤치마크를 사용한 LLM 평가를 더욱 포괄적으로 지원합니다.

English

Recent advancements in reasoning-based Large Language Models (LLMs), particularly their potential through test-time scaling, have created significant opportunities for distillation in code generation and critique. However, progress in both areas fundamentally depends on large-scale, high-quality datasets. In this work, we introduce OpenCodeReasoning-II, a dataset consists of 2.5M question-solution-critique triples (approx. 35K unique programming questions), making it nearly twice the size of the previous largest publicly available code reasoning dataset. In this work, we employ a two-stage supervised fine-tuning strategy. The first stage focuses on fine-tuning for code generation, while the second stage involves the joint training of models for both code generation and critique. Our resulting finetuned Qwen2.5-Instruct models achieve performance in code generation that either exceeds or equals the best prior open-weight distilled models. Notably, the integration of our code generation and critique models leads to significant improvements in competitive coding performance. Furthermore, we present an extension of the LiveCodeBench benchmark to specifically support the C++ programming language, thereby facilitating more comprehensive LLM evaluation using this benchmark.

OpenCodeReasoning-II: 자기 비판을 통한 간단한 테스트 시간 확장 접근법

OpenCodeReasoning-II: A Simple Test Time Scaling Approach via Self-Critique

초록

Support