语言模型无需标量奖励即可从语言反馈中学习。
Language Models Can Learn from Verbal Feedback Without Scalar Rewards
September 26, 2025
作者: Renjie Luo, Zichen Liu, Xiangyan Liu, Chao Du, Min Lin, Wenhu Chen, Wei Lu, Tianyu Pang
cs.AI
摘要
大型语言模型(LLMs)常通过人类或AI反馈进行强化学习训练,然而这类方法通常将细致的反馈压缩为标量奖励,舍弃了其丰富性并引发尺度失衡问题。我们提出将语言反馈视为一种条件信号。受文本到图像生成中语言先验的启发,该先验能够从未见过的提示中产生新颖输出,我们引入了反馈条件策略(FCP)。FCP直接从响应-反馈对中学习,通过离线数据的最大似然训练来近似反馈条件下的后验分布。我们进一步开发了一个在线自举阶段,在此阶段,策略在积极条件下生成响应并接收新的反馈以自我优化。这重新定义了反馈驱动的学习,将其视为条件生成而非奖励优化,为LLMs提供了一种更富表现力的方式直接从语言反馈中学习。我们的代码可在https://github.com/sail-sg/feedback-conditional-policy获取。
English
LLMs are often trained with RL from human or AI feedback, yet such methods
typically compress nuanced feedback into scalar rewards, discarding much of
their richness and inducing scale imbalance. We propose treating verbal
feedback as a conditioning signal. Inspired by language priors in text-to-image
generation, which enable novel outputs from unseen prompts, we introduce the
feedback-conditional policy (FCP). FCP learns directly from response-feedback
pairs, approximating the feedback-conditional posterior through maximum
likelihood training on offline data. We further develop an online bootstrapping
stage where the policy generates under positive conditions and receives fresh
feedback to refine itself. This reframes feedback-driven learning as
conditional generation rather than reward optimization, offering a more
expressive way for LLMs to directly learn from verbal feedback. Our code is
available at https://github.com/sail-sg/feedback-conditional-policy.