RLVF：從口語反饋中學習，避免過度泛化

摘要

大型語言模型（LLMs）部署的多樣性背景需要能夠修改或自定義默認模型行為，以納入微妙的需求和偏好。一個方便的界面來指定這些模型調整是高層次的口頭反饋，例如“在給老闆起草郵件時不要使用表情符號”。然而，盡管撰寫高層次反饋比從人類反饋中收集強化學習標註（RLHF）要簡單得多，我們發現僅僅提示模型使用這樣的反饋會導致反饋在不相關的情境中過度泛化。我們研究了如何在不出現這種過度泛化的情況下納入口頭反饋的問題，啟發了一種新方法，即具有受限制偏好優化的情境化評論（C3PO）。C3PO使用一段高層次反饋來生成一個小型合成偏好數據集，指定了反饋應該（和不應該）應用的方式。然後，它根據合成偏好數據微調模型，同時最小化在不適用反饋的提示中與原始模型的差異。我們的實驗結果表明，我們的方法有效地將口頭反饋應用於相關情境，同時保留其他情境的現有行為。對於人類和GPT-4生成的高層次反饋，C3PO與上下文基準相比有效地遵循了給定的反饋，同時將過度泛化減少了30%。

English

The diversity of contexts in which large language models (LLMs) are deployed requires the ability to modify or customize default model behaviors to incorporate nuanced requirements and preferences. A convenient interface to specify such model adjustments is high-level verbal feedback, such as "Don't use emojis when drafting emails to my boss." However, while writing high-level feedback is far simpler than collecting annotations for reinforcement learning from human feedback (RLHF), we find that simply prompting a model with such feedback leads to overgeneralization of the feedback to contexts where it is not relevant. We study the problem of incorporating verbal feedback without such overgeneralization, inspiring a new method Contextualized Critiques with Constrained Preference Optimization (C3PO). C3PO uses a piece of high-level feedback to generate a small synthetic preference dataset specifying how the feedback should (and should not) be applied. It then fine-tunes the model in accordance with the synthetic preference data while minimizing the divergence from the original model for prompts where the feedback does not apply. Our experimental results indicate that our approach effectively applies verbal feedback to relevant scenarios while preserving existing behaviors for other contexts. For both human- and GPT-4-generated high-level feedback, C3PO effectively adheres to the given feedback comparably to in-context baselines while reducing overgeneralization by 30%.

RLVF：從口語反饋中學習，避免過度泛化

RLVF: Learning from Verbal Feedback without Overgeneralization

摘要

Support