RLVF:学习口头反馈而不过度泛化
RLVF: Learning from Verbal Feedback without Overgeneralization
February 16, 2024
作者: Moritz Stephan, Alexander Khazatsky, Eric Mitchell, Annie S Chen, Sheryl Hsu, Archit Sharma, Chelsea Finn
cs.AI
摘要
大型语言模型(LLMs)部署的多样化背景需要能够修改或自定义默认模型行为,以整合微妙的需求和偏好。指定这种模型调整的便捷接口是高级口头反馈,例如“在给老板起草电子邮件时不要使用表情符号”。然而,尽管编写高级反馈比从人类反馈(RLHF)中收集注释要简单得多,我们发现仅仅提示模型使用这种反馈会导致将反馈过度泛化到不相关的情境中。我们研究了如何在不产生这种过度泛化的情况下整合口头反馈的问题,提出了一种新方法,即带有受限偏好优化的情境化评论(C3PO)。C3PO利用一小段高级反馈生成一个指定如何(以及如何不)应用反馈的小型合成偏好数据集。然后,它根据合成偏好数据微调模型,同时最小化在不适用反馈的提示中与原始模型的差异。我们的实验结果表明,我们的方法有效地将口头反馈应用于相关场景,同时保留其他情境的现有行为。对于人类和GPT-4生成的高级反馈,C3PO与上下文基线相比有效地遵循给定的反馈,同时减少了30%的过度泛化。
English
The diversity of contexts in which large language models (LLMs) are deployed
requires the ability to modify or customize default model behaviors to
incorporate nuanced requirements and preferences. A convenient interface to
specify such model adjustments is high-level verbal feedback, such as "Don't
use emojis when drafting emails to my boss." However, while writing high-level
feedback is far simpler than collecting annotations for reinforcement learning
from human feedback (RLHF), we find that simply prompting a model with such
feedback leads to overgeneralization of the feedback to contexts where it is
not relevant. We study the problem of incorporating verbal feedback without
such overgeneralization, inspiring a new method Contextualized Critiques with
Constrained Preference Optimization (C3PO). C3PO uses a piece of high-level
feedback to generate a small synthetic preference dataset specifying how the
feedback should (and should not) be applied. It then fine-tunes the model in
accordance with the synthetic preference data while minimizing the divergence
from the original model for prompts where the feedback does not apply. Our
experimental results indicate that our approach effectively applies verbal
feedback to relevant scenarios while preserving existing behaviors for other
contexts. For both human- and GPT-4-generated high-level feedback, C3PO
effectively adheres to the given feedback comparably to in-context baselines
while reducing overgeneralization by 30%.