ChatPaper.aiChatPaper

Critique-RL:基于两阶段强化学习的语言模型批判能力训练框架

Critique-RL: Training Language Models for Critiquing through Two-Stage Reinforcement Learning

October 28, 2025
作者: Zhiheng Xi, Jixuan Huang, Xin Guo, Boyang Hong, Dingwen Yang, Xiaoran Fan, Shuo Li, Zehui Chen, Junjie Ye, Siyu Yuan, Zhengyin Du, Xuesong Yao, Yufei Xu, Jiecao Chen, Rui Zheng, Tao Gui, Qi Zhang, Xuanjing Huang
cs.AI

摘要

训练批判性语言模型来评估模型输出并提供反馈,是提升大型语言模型复杂推理能力的有效途径。然而现有方法通常依赖更强监督者进行批判数据标注。为解决这一问题,我们提出Critique-RL——一种无需强监督即可开发批判性语言模型的在线强化学习方法。该方法采用双智能体交互范式:行动者生成响应,批判者提供反馈,行动者据此优化回答。我们首先发现,仅依靠行动者输出产生的间接奖励信号进行强化学习优化,往往会导致批判者表现欠佳:其帮助性(即提供建设性反馈的能力)虽有提升,但判别力(即判断响应质量高低的能力)仍然不足,最终导致性能提升有限。为突破此局限,Critique-RL采用两阶段优化策略:第一阶段通过基于规则的直接奖励信号强化批判者的判别力;第二阶段引入基于行动者优化效果的间接奖励来提升批判者的帮助性,同时通过适当正则化保持其判别力。在多任务和多模型的广泛实验中,Critique-RL均展现出显著性能提升。以Qwen2.5-7B模型为例,其在领域内任务和领域外任务上分别实现9.02%和5.70%的性能增益,充分彰显了该方法的潜力。
English
Training critiquing language models to assess and provide feedback on model outputs is a promising way to improve LLMs for complex reasoning tasks. However, existing approaches typically rely on stronger supervisors for annotating critique data. To address this, we propose Critique-RL, an online RL approach for developing critiquing language models without stronger supervision. Our approach operates on a two-player paradigm: the actor generates a response, the critic provides feedback, and the actor refines the response accordingly. We first reveal that relying solely on indirect reward signals from the actor's outputs for RL optimization often leads to unsatisfactory critics: while their helpfulness (i.e., providing constructive feedback) improves, the discriminability (i.e., determining whether a response is high-quality or not) remains poor, resulting in marginal performance gains. To overcome this, Critique-RL adopts a two-stage optimization strategy. In stage I, it reinforces the discriminability of the critic with direct rule-based reward signals; in stage II, it introduces indirect rewards based on actor refinement to improve the critic's helpfulness, while maintaining its discriminability via appropriate regularization. Extensive experiments across various tasks and models show that Critique-RL delivers substantial performance improvements. For example, it achieves a 9.02% gain on in-domain tasks and a 5.70% gain on out-of-domain tasks for Qwen2.5-7B, highlighting its potential.
PDF183December 1, 2025