Critique-Coder：透過批判強化學習提升編碼器模型效能

摘要

強化學習（Reinforcement Learning, RL）已成為一種廣受歡迎的訓練範式，尤其是在與推理模型結合時。儘管其效果顯著，但主要聚焦於生成回應，缺乏明確促進批判或反思的機制。近期的幾項研究，如批判微調（Critique-Fine-Tuning, CFT）和批判引導蒸餾（Critique-Guided-Distillation, CGD），已展示了明確教導大型語言模型（LLMs）如何進行批判的益處。受此啟發，我們提出了批判強化學習（Critique Reinforcement Learning, CRL），其中模型需為給定的（問題，解答）對生成批判。獎勵僅基於生成的批判的最終判斷標籤c（屬於{真，假}）是否與真實判斷c^*一致。基於此，我們引入了Critique-Coder，該模型通過將標準RL數據的20%替換為CRL數據，在RL與CRL的混合訓練下進行訓練。我們對多個模型（Critique-Coder）進行微調，並在不同基準上評估它們，以展示其相較於僅使用RL模型的優勢。結果表明，Critique-Coder在所有評估基準上均持續超越僅使用RL的基線模型。值得注意的是，我們的Critique-Coder-8B在LiveCodeBench（v5）上可達到超過60%的成績，優於其他推理模型如DeepCoder-14B和GPT-o1。除了代碼生成，Critique-Coder還展現了增強的一般推理能力，這在BBEH數據集的邏輯推理任務中表現更佳。這表明，在編碼數據集上應用CRL能提升一般推理和批判能力，這些能力可廣泛轉移至多種任務。因此，我們相信CRL是對標準RL在LLM推理中的有力補充。

English

Reinforcement Learning (RL) has emerged as a popular training paradigm, particularly when paired with reasoning models. While effective, it primarily focuses on generating responses and lacks mechanisms to explicitly foster critique or reflection. Several recent studies, like Critique-Fine-Tuning (CFT) and Critique-Guided-Distillation (CGD) have shown the benefits of explicitly teaching LLMs how to critique. Motivated by them, we propose Critique Reinforcement Learning (CRL), where the model is tasked with generating a critique for a given (question, solution) pair. The reward is determined solely by whether the final judgment label c in {True, False} of the generated critique aligns with the ground-truth judgment c^*. Building on this point, we introduce Critique-Coder, which is trained on a hybrid of RL and CRL by substituting 20\% of the standard RL data with CRL data. We fine-tune multiple models (Critique-Coder) and evaluate them on different benchmarks to show their advantages over RL-only models. We show that Critique-Coder consistently outperforms RL-only baselines on all the evaluated benchmarks. Notably, our Critique-Coder-8B can reach over 60\% on LiveCodeBench (v5), outperforming other reasoning models like DeepCoder-14B and GPT-o1. Beyond code generation, Critique-Coder also demonstrates enhanced general reasoning abilities, as evidenced by its better performance on logic reasoning tasks from the BBEH dataset. This indicates that the application of CRL on coding datasets enhances general reasoning and critique abilities, which are transferable across a broad range of tasks. Hence, we believe that CRL works as a great complement to standard RL for LLM reasoning.

Critique-Coder：透過批判強化學習提升編碼器模型效能

Critique-Coder: Enhancing Coder Models by Critique Reinforcement Learning

摘要

Support