RLVF:從口語反饋中學習,避免過度泛化
RLVF: Learning from Verbal Feedback without Overgeneralization
February 16, 2024
作者: Moritz Stephan, Alexander Khazatsky, Eric Mitchell, Annie S Chen, Sheryl Hsu, Archit Sharma, Chelsea Finn
cs.AI
摘要
大型語言模型(LLMs)部署的多樣性背景需要能夠修改或自定義默認模型行為,以納入微妙的需求和偏好。一個方便的界面來指定這些模型調整是高層次的口頭反饋,例如“在給老闆起草郵件時不要使用表情符號”。然而,盡管撰寫高層次反饋比從人類反饋中收集強化學習標註(RLHF)要簡單得多,我們發現僅僅提示模型使用這樣的反饋會導致反饋在不相關的情境中過度泛化。我們研究了如何在不出現這種過度泛化的情況下納入口頭反饋的問題,啟發了一種新方法,即具有受限制偏好優化的情境化評論(C3PO)。C3PO使用一段高層次反饋來生成一個小型合成偏好數據集,指定了反饋應該(和不應該)應用的方式。然後,它根據合成偏好數據微調模型,同時最小化在不適用反饋的提示中與原始模型的差異。我們的實驗結果表明,我們的方法有效地將口頭反饋應用於相關情境,同時保留其他情境的現有行為。對於人類和GPT-4生成的高層次反饋,C3PO與上下文基準相比有效地遵循了給定的反饋,同時將過度泛化減少了30%。
English
The diversity of contexts in which large language models (LLMs) are deployed
requires the ability to modify or customize default model behaviors to
incorporate nuanced requirements and preferences. A convenient interface to
specify such model adjustments is high-level verbal feedback, such as "Don't
use emojis when drafting emails to my boss." However, while writing high-level
feedback is far simpler than collecting annotations for reinforcement learning
from human feedback (RLHF), we find that simply prompting a model with such
feedback leads to overgeneralization of the feedback to contexts where it is
not relevant. We study the problem of incorporating verbal feedback without
such overgeneralization, inspiring a new method Contextualized Critiques with
Constrained Preference Optimization (C3PO). C3PO uses a piece of high-level
feedback to generate a small synthetic preference dataset specifying how the
feedback should (and should not) be applied. It then fine-tunes the model in
accordance with the synthetic preference data while minimizing the divergence
from the original model for prompts where the feedback does not apply. Our
experimental results indicate that our approach effectively applies verbal
feedback to relevant scenarios while preserving existing behaviors for other
contexts. For both human- and GPT-4-generated high-level feedback, C3PO
effectively adheres to the given feedback comparably to in-context baselines
while reducing overgeneralization by 30%.