批判性编码者：通过批判强化学习提升编码模型性能

摘要

强化学习（Reinforcement Learning, RL）已成为一种广受欢迎的训练范式，尤其是在与推理模型结合使用时。尽管效果显著，它主要侧重于生成响应，缺乏明确促进批判或反思的机制。近期多项研究，如批判微调（Critique-Fine-Tuning, CFT）和批判引导蒸馏（Critique-Guided-Distillation, CGD），已展示了明确教授大语言模型（LLMs）如何进行批判的益处。受此启发，我们提出了批判强化学习（Critique Reinforcement Learning, CRL），其中模型的任务是为给定的（问题，解决方案）对生成批判。奖励仅取决于生成的批判的最终判断标签c（属于{真，假}）是否与真实判断c^*一致。基于这一点，我们引入了Critique-Coder，它通过在标准RL数据中替换20%为CRL数据进行混合训练。我们对多个模型（Critique-Coder）进行了微调，并在不同基准上评估，以展示其相较于仅使用RL模型的优势。结果表明，Critique-Coder在所有评估基准上均持续超越仅使用RL的基线模型。值得注意的是，我们的Critique-Coder-8B在LiveCodeBench（v5）上达到了超过60%的得分，超越了DeepCoder-14B和GPT-o1等其他推理模型。除了代码生成，Critique-Coder还展现了增强的通用推理能力，这体现在其在BBEH数据集上的逻辑推理任务中表现更佳。这表明，在编码数据集上应用CRL不仅提升了通用推理和批判能力，这些能力还能广泛迁移至多种任务中。因此，我们相信CRL作为标准RL的补充，对于LLM推理具有重要价值。

English

Reinforcement Learning (RL) has emerged as a popular training paradigm, particularly when paired with reasoning models. While effective, it primarily focuses on generating responses and lacks mechanisms to explicitly foster critique or reflection. Several recent studies, like Critique-Fine-Tuning (CFT) and Critique-Guided-Distillation (CGD) have shown the benefits of explicitly teaching LLMs how to critique. Motivated by them, we propose Critique Reinforcement Learning (CRL), where the model is tasked with generating a critique for a given (question, solution) pair. The reward is determined solely by whether the final judgment label c in {True, False} of the generated critique aligns with the ground-truth judgment c^*. Building on this point, we introduce Critique-Coder, which is trained on a hybrid of RL and CRL by substituting 20\% of the standard RL data with CRL data. We fine-tune multiple models (Critique-Coder) and evaluate them on different benchmarks to show their advantages over RL-only models. We show that Critique-Coder consistently outperforms RL-only baselines on all the evaluated benchmarks. Notably, our Critique-Coder-8B can reach over 60\% on LiveCodeBench (v5), outperforming other reasoning models like DeepCoder-14B and GPT-o1. Beyond code generation, Critique-Coder also demonstrates enhanced general reasoning abilities, as evidenced by its better performance on logic reasoning tasks from the BBEH dataset. This indicates that the application of CRL on coding datasets enhances general reasoning and critique abilities, which are transferable across a broad range of tasks. Hence, we believe that CRL works as a great complement to standard RL for LLM reasoning.

批判性编码者：通过批判强化学习提升编码模型性能

Critique-Coder: Enhancing Coder Models by Critique Reinforcement Learning

摘要

Support