Text2Grad:基於自然語言反饋的強化學習
Text2Grad: Reinforcement Learning from Natural Language Feedback
May 28, 2025
作者: Hanyang Wang, Lu Wang, Chaoyun Zhang, Tianjun Mao, Si Qin, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang
cs.AI
摘要
傳統的RLHF(基於人類反饋的強化學習)使用粗糙的標量獎勵來優化語言模型,這種方法掩蓋了成功或失敗背後的細微原因,導致學習過程緩慢且不透明。近期研究通過提示或反思的方式,以文本批評來增強RL,雖然提升了可解釋性,但並未觸及模型參數的調整。我們提出了Text2Grad,這是一種將自由形式的文本反饋轉化為片段級梯度的強化學習範式。在接收到人類(或程序化)的批評後,Text2Grad將每條反饋短語與相關的詞元片段對齊,將這些對齊轉化為可微分的獎勵信號,並執行梯度更新,直接精煉模型策略中的問題部分。這產生了精確的、基於反饋的調整,而非全局性的微調。Text2Grad通過三個組件實現:(1) 一個高質量的反饋-註釋管道,將批評與詞元片段配對;(2) 一個細粒度的獎勵模型,在生成解釋性批評的同時預測答案的片段級獎勵;以及(3) 一個片段級策略優化器,反向傳播自然語言梯度。在摘要生成、代碼生成和問答任務中,Text2Grad一致超越了基於標量獎勵的RL和僅使用提示的基線方法,不僅提供了更高的任務指標,還增強了可解釋性。我們的結果表明,當自然語言反饋被轉化為梯度時,它是一種強大的信號,用於細粒度的策略優化。我們方法的代碼可在https://github.com/microsoft/Text2Grad獲取。
English
Traditional RLHF optimizes language models with coarse, scalar rewards that
mask the fine-grained reasons behind success or failure, leading to slow and
opaque learning. Recent work augments RL with textual critiques through
prompting or reflection, improving interpretability but leaving model
parameters untouched. We introduce Text2Grad, a reinforcement-learning paradigm
that turns free-form textual feedback into span-level gradients. Given human
(or programmatic) critiques, Text2Grad aligns each feedback phrase with the
relevant token spans, converts these alignments into differentiable reward
signals, and performs gradient updates that directly refine the offending
portions of the model's policy. This yields precise, feedback-conditioned
adjustments instead of global nudges. Text2Grad is realized through three
components: (1) a high-quality feedback-annotation pipeline that pairs
critiques with token spans; (2) a fine-grained reward model that predicts
span-level reward on answer while generating explanatory critiques; and (3) a
span-level policy optimizer that back-propagates natural-language gradients.
Across summarization, code generation, and question answering, Text2Grad
consistently surpasses scalar-reward RL and prompt-only baselines, providing
both higher task metrics and richer interpretability. Our results demonstrate
that natural-language feedback, when converted to gradients, is a powerful
signal for fine-grained policy optimization. The code for our method is
available at https://github.com/microsoft/Text2GradSummary
AI-Generated Summary