Critique-Coder: 비판 강화 학습을 통한 코더 모델 성능 향상

초록

강화 학습(Reinforcement Learning, RL)은 특히 추론 모델과 결합할 때 널리 사용되는 학습 패러다임으로 자리 잡았습니다. 효과적이기는 하지만, 주로 응답 생성에 초점을 맞추고 있으며 비판이나 성찰을 명시적으로 촉진하는 메커니즘은 부족합니다. 최근의 여러 연구, 예를 들어 Critique-Fine-Tuning(CFT)과 Critique-Guided-Distillation(CGD)은 대형 언어 모델(LLM)에게 비판하는 방법을 명시적으로 가르치는 것의 이점을 보여주었습니다. 이러한 연구에 영감을 받아, 우리는 비판 강화 학습(Critique Reinforcement Learning, CRL)을 제안합니다. 이 방법에서는 모델이 주어진 (질문, 해결책) 쌍에 대한 비판을 생성하는 과제를 수행하며, 생성된 비판의 최종 판단 레이블 c가 {True, False} 중 어느 하나인지와 이 레이블이 실제 판단 c^*와 일치하는지에 따라 보상이 결정됩니다. 이를 바탕으로, 우리는 표준 RL 데이터의 20%를 CRL 데이터로 대체하여 RL과 CRL을 혼합한 방식으로 학습된 Critique-Coder를 소개합니다. 여러 모델(Critique-Coder)을 미세 조정하고 다양한 벤치마크에서 평가하여 RL만 사용한 모델보다 우수한 성능을 보임을 입증합니다. Critique-Coder는 평가된 모든 벤치마크에서 RL만 사용한 기준 모델을 일관되게 능가합니다. 특히, 우리의 Critique-Coder-8B는 LiveCodeBench(v5)에서 60% 이상의 성적을 달성하며, DeepCoder-14B나 GPT-o1과 같은 다른 추론 모델을 능가합니다. 코드 생성뿐만 아니라, Critique-Coder는 BBEH 데이터셋의 논리 추론 과제에서도 더 나은 성능을 보이며 일반적인 추론 능력이 향상되었음을 입증합니다. 이는 코딩 데이터셋에 CRL을 적용함으로써 일반적인 추론 및 비판 능력이 향상되며, 이러한 능력이 다양한 작업에 전이될 수 있음을 시사합니다. 따라서, 우리는 CRL이 LLM 추론을 위한 표준 RL의 훌륭한 보완제로 작용한다고 믿습니다.

English

Reinforcement Learning (RL) has emerged as a popular training paradigm, particularly when paired with reasoning models. While effective, it primarily focuses on generating responses and lacks mechanisms to explicitly foster critique or reflection. Several recent studies, like Critique-Fine-Tuning (CFT) and Critique-Guided-Distillation (CGD) have shown the benefits of explicitly teaching LLMs how to critique. Motivated by them, we propose Critique Reinforcement Learning (CRL), where the model is tasked with generating a critique for a given (question, solution) pair. The reward is determined solely by whether the final judgment label c in {True, False} of the generated critique aligns with the ground-truth judgment c^*. Building on this point, we introduce Critique-Coder, which is trained on a hybrid of RL and CRL by substituting 20\% of the standard RL data with CRL data. We fine-tune multiple models (Critique-Coder) and evaluate them on different benchmarks to show their advantages over RL-only models. We show that Critique-Coder consistently outperforms RL-only baselines on all the evaluated benchmarks. Notably, our Critique-Coder-8B can reach over 60\% on LiveCodeBench (v5), outperforming other reasoning models like DeepCoder-14B and GPT-o1. Beyond code generation, Critique-Coder also demonstrates enhanced general reasoning abilities, as evidenced by its better performance on logic reasoning tasks from the BBEH dataset. This indicates that the application of CRL on coding datasets enhances general reasoning and critique abilities, which are transferable across a broad range of tasks. Hence, we believe that CRL works as a great complement to standard RL for LLM reasoning.

Critique-Coder: 비판 강화 학습을 통한 코더 모델 성능 향상

Critique-Coder: Enhancing Coder Models by Critique Reinforcement Learning

초록

Support