ChatPaper.aiChatPaper

批判式强化学习:通过两阶段强化学习训练语言模型进行评论生成

Critique-RL: Training Language Models for Critiquing through Two-Stage Reinforcement Learning

October 28, 2025
作者: Zhiheng Xi, Jixuan Huang, Xin Guo, Boyang Hong, Dingwen Yang, Xiaoran Fan, Shuo Li, Zehui Chen, Junjie Ye, Siyu Yuan, Zhengyin Du, Xuesong Yao, Yufei Xu, Jiecao Chen, Rui Zheng, Tao Gui, Qi Zhang, Xuanjing Huang
cs.AI

摘要

训练批判性语言模型以评估模型输出并提供反馈,是提升大语言模型复杂推理能力的有效途径。然而现有方法通常依赖更强的监督源进行批判数据标注。为此,我们提出Critique-RL——一种无需强监督的在线强化学习框架,用于开发批判性语言模型。该方法采用双智能体交互范式:行动者生成初始回答,批判者提供反馈,行动者据此优化回答。我们首先发现,若仅依赖行动者输出的间接奖励信号进行强化学习优化,往往导致批判者能力失衡:其帮助性(即提供建设性反馈的能力)虽有所提升,但判别力(即判断回答质量优劣的能力)仍显不足,最终造成性能提升有限。为突破此局限,Critique-RL采用两阶段优化策略:第一阶段通过基于规则的直接奖励信号强化批判者的判别力;第二阶段引入基于行动者优化效果的间接奖励来提升批判者的帮助性,同时通过正则化手段保持其判别力稳定性。在多任务和多模型的广泛实验中,Critique-RL均带来显著性能提升。以Qwen2.5-7B模型为例,其在领域内任务和领域外任务上分别实现9.02%和5.70%的性能增益,彰显了该方法的潜力。
English
Training critiquing language models to assess and provide feedback on model outputs is a promising way to improve LLMs for complex reasoning tasks. However, existing approaches typically rely on stronger supervisors for annotating critique data. To address this, we propose Critique-RL, an online RL approach for developing critiquing language models without stronger supervision. Our approach operates on a two-player paradigm: the actor generates a response, the critic provides feedback, and the actor refines the response accordingly. We first reveal that relying solely on indirect reward signals from the actor's outputs for RL optimization often leads to unsatisfactory critics: while their helpfulness (i.e., providing constructive feedback) improves, the discriminability (i.e., determining whether a response is high-quality or not) remains poor, resulting in marginal performance gains. To overcome this, Critique-RL adopts a two-stage optimization strategy. In stage I, it reinforces the discriminability of the critic with direct rule-based reward signals; in stage II, it introduces indirect rewards based on actor refinement to improve the critic's helpfulness, while maintaining its discriminability via appropriate regularization. Extensive experiments across various tasks and models show that Critique-RL delivers substantial performance improvements. For example, it achieves a 9.02% gain on in-domain tasks and a 5.70% gain on out-of-domain tasks for Qwen2.5-7B, highlighting its potential.
PDF183December 1, 2025