RLVF：学习口头反馈而不过度泛化

摘要

大型语言模型（LLMs）部署的多样化背景需要能够修改或自定义默认模型行为，以整合微妙的需求和偏好。指定这种模型调整的便捷接口是高级口头反馈，例如“在给老板起草电子邮件时不要使用表情符号”。然而，尽管编写高级反馈比从人类反馈（RLHF）中收集注释要简单得多，我们发现仅仅提示模型使用这种反馈会导致将反馈过度泛化到不相关的情境中。我们研究了如何在不产生这种过度泛化的情况下整合口头反馈的问题，提出了一种新方法，即带有受限偏好优化的情境化评论（C3PO）。C3PO利用一小段高级反馈生成一个指定如何（以及如何不）应用反馈的小型合成偏好数据集。然后，它根据合成偏好数据微调模型，同时最小化在不适用反馈的提示中与原始模型的差异。我们的实验结果表明，我们的方法有效地将口头反馈应用于相关场景，同时保留其他情境的现有行为。对于人类和GPT-4生成的高级反馈，C3PO与上下文基线相比有效地遵循给定的反馈，同时减少了30%的过度泛化。

English

The diversity of contexts in which large language models (LLMs) are deployed requires the ability to modify or customize default model behaviors to incorporate nuanced requirements and preferences. A convenient interface to specify such model adjustments is high-level verbal feedback, such as "Don't use emojis when drafting emails to my boss." However, while writing high-level feedback is far simpler than collecting annotations for reinforcement learning from human feedback (RLHF), we find that simply prompting a model with such feedback leads to overgeneralization of the feedback to contexts where it is not relevant. We study the problem of incorporating verbal feedback without such overgeneralization, inspiring a new method Contextualized Critiques with Constrained Preference Optimization (C3PO). C3PO uses a piece of high-level feedback to generate a small synthetic preference dataset specifying how the feedback should (and should not) be applied. It then fine-tunes the model in accordance with the synthetic preference data while minimizing the divergence from the original model for prompts where the feedback does not apply. Our experimental results indicate that our approach effectively applies verbal feedback to relevant scenarios while preserving existing behaviors for other contexts. For both human- and GPT-4-generated high-level feedback, C3PO effectively adheres to the given feedback comparably to in-context baselines while reducing overgeneralization by 30%.

RLVF：学习口头反馈而不过度泛化

RLVF: Learning from Verbal Feedback without Overgeneralization

摘要

Support