RL4F：使用强化学习生成自然语言反馈，用于修复模型输出

摘要

尽管规模庞大的语言模型取得了前所未有的成功，但它们仍会犯错。类似于人类通过反馈学习和改进的方式，先前的研究提出为语言模型提供自然语言反馈，以指导其修正输出。由于人工生成的批评成本高昂，研究人员设计了学习批评生成器来取代人类评论者，同时假设可以训练下游模型利用生成的反馈。然而，这种方法不适用于黑盒或受限制访问的模型，如ChatGPT，因为它们无法进行微调。此外，在大型通用语言代理的时代，微调既不具备计算效率，也不具备空间效率，因为会导致网络的多个副本。在这项工作中，我们介绍了RL4F（强化学习用于反馈），这是一个多智能体协作框架，其中批评生成器经过训练，以最大化GPT-3的终端任务性能，后者是其200多倍的固定模型。RL4F生成的批评有助于GPT-3修订其输出。我们研究了三个数据集，用于行动规划、摘要和字母排序，并展示了在所有三个任务中，相对于强基线，多个文本相似性指标的改进（平均约5%）。

English

Despite their unprecedented success, even the largest language models make mistakes. Similar to how humans learn and improve using feedback, previous work proposed providing language models with natural language feedback to guide them in repairing their outputs. Because human-generated critiques are expensive to obtain, researchers have devised learned critique generators in lieu of human critics while assuming one can train downstream models to utilize generated feedback. However, this approach does not apply to black-box or limited access models such as ChatGPT, as they cannot be fine-tuned. Moreover, in the era of large general-purpose language agents, fine-tuning is neither computationally nor spatially efficient as it results in multiple copies of the network. In this work, we introduce RL4F (Reinforcement Learning for Feedback), a multi-agent collaborative framework where the critique generator is trained to maximize end-task performance of GPT-3, a fixed model more than 200 times its size. RL4F produces critiques that help GPT-3 revise its outputs. We study three datasets for action planning, summarization and alphabetization and show improvements (~5% on average) in multiple text similarity metrics over strong baselines across all three tasks.

RL4F：使用强化学习生成自然语言反馈，用于修复模型输出

RL4F: Generating Natural Language Feedback with Reinforcement Learning for Repairing Model Outputs

摘要

Support