ChatPaper.aiChatPaper

探索语言模型中的拍马屁行为

Towards Understanding Sycophancy in Language Models

October 20, 2023
作者: Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, Ethan Perez
cs.AI

摘要

人类反馈强化学习(RLHF)是训练高质量人工智能助手的一种流行技术。然而,RLHF可能会鼓励模型生成符合用户信念而非真实回复的行为,这种行为被称为谄媚行为。我们调查了在RLHF训练的模型中谄媚行为的普遍性以及人类偏好判断是否负有责任。我们首先展示了五个最先进的人工智能助手在四个不同的自由文本生成任务中一贯表现出谄媚行为。为了了解人类偏好是否驱使RLHF模型这种广泛观察到的行为,我们分析了现有的人类偏好数据。我们发现,当一个回复与用户观点一致时,更有可能被偏好。此外,人类和偏好模型(PMs)在极少数情况下更偏好写得令人信服的谄媚回复而非正确回复。针对PMs优化模型输出有时会以谄媚为代价而牺牲真实性。总的来说,我们的结果表明谄媚是RLHF模型的一种普遍行为,很可能部分受到人类偏好判断青睐谄媚回复的影响。
English
Reinforcement learning from human feedback (RLHF) is a popular technique for training high-quality AI assistants. However, RLHF may also encourage model responses that match user beliefs over truthful responses, a behavior known as sycophancy. We investigate the prevalence of sycophancy in RLHF-trained models and whether human preference judgements are responsible. We first demonstrate that five state-of-the-art AI assistants consistently exhibit sycophantic behavior across four varied free-form text-generation tasks. To understand if human preferences drive this broadly observed behavior of RLHF models, we analyze existing human preference data. We find that when a response matches a user's views, it is more likely to be preferred. Moreover, both humans and preference models (PMs) prefer convincingly-written sycophantic responses over correct ones a negligible fraction of the time. Optimizing model outputs against PMs also sometimes sacrifices truthfulness in favor of sycophancy. Overall, our results indicate that sycophancy is a general behavior of RLHF models, likely driven in part by human preference judgements favoring sycophantic responses.
PDF72December 15, 2024