ChatPaper.aiChatPaper

Text2Grad:基于自然语言反馈的强化学习

Text2Grad: Reinforcement Learning from Natural Language Feedback

May 28, 2025
作者: Hanyang Wang, Lu Wang, Chaoyun Zhang, Tianjun Mao, Si Qin, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang
cs.AI

摘要

传统的RLHF(基于人类反馈的强化学习)通过粗糙的标量奖励优化语言模型,这些奖励掩盖了成功或失败的细粒度原因,导致学习过程缓慢且不透明。近期研究通过提示或反思,用文本批评增强RL,虽提升了可解释性,但模型参数未作调整。我们提出了Text2Grad,一种将自由形式文本反馈转化为跨度级别梯度的强化学习范式。在接收到人类(或程序化)批评后,Text2Grad将每条反馈短语与相关令牌跨度对齐,将这些对齐转换为可微分的奖励信号,并执行梯度更新,直接修正模型策略中的问题部分。这实现了基于反馈的精确调整,而非全局性的微调。Text2Grad通过三个组件实现:(1) 一个高质量的反馈标注流程,将批评与令牌跨度配对;(2) 一个细粒度奖励模型,在生成解释性批评的同时预测答案的跨度级别奖励;(3) 一个跨度级别策略优化器,反向传播自然语言梯度。在摘要生成、代码生成和问答任务中,Text2Grad均超越了标量奖励RL和仅提示的基线方法,不仅提高了任务指标,还提供了更丰富的可解释性。我们的结果表明,将自然语言反馈转化为梯度,是进行细粒度策略优化的强大信号。本方法的代码已发布于https://github.com/microsoft/Text2Grad。
English
Traditional RLHF optimizes language models with coarse, scalar rewards that mask the fine-grained reasons behind success or failure, leading to slow and opaque learning. Recent work augments RL with textual critiques through prompting or reflection, improving interpretability but leaving model parameters untouched. We introduce Text2Grad, a reinforcement-learning paradigm that turns free-form textual feedback into span-level gradients. Given human (or programmatic) critiques, Text2Grad aligns each feedback phrase with the relevant token spans, converts these alignments into differentiable reward signals, and performs gradient updates that directly refine the offending portions of the model's policy. This yields precise, feedback-conditioned adjustments instead of global nudges. Text2Grad is realized through three components: (1) a high-quality feedback-annotation pipeline that pairs critiques with token spans; (2) a fine-grained reward model that predicts span-level reward on answer while generating explanatory critiques; and (3) a span-level policy optimizer that back-propagates natural-language gradients. Across summarization, code generation, and question answering, Text2Grad consistently surpasses scalar-reward RL and prompt-only baselines, providing both higher task metrics and richer interpretability. Our results demonstrate that natural-language feedback, when converted to gradients, is a powerful signal for fine-grained policy optimization. The code for our method is available at https://github.com/microsoft/Text2Grad

Summary

AI-Generated Summary

PDF62May 29, 2025