朝向理解語言模型中的諂媚行為
Towards Understanding Sycophancy in Language Models
October 20, 2023
作者: Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, Ethan Perez
cs.AI
摘要
從人類反饋中學習的強化學習(RLHF)是訓練高質量人工智能助手的一種流行技術。然而,RLHF可能會鼓勵模型的回應與用戶信念相符,而非真實回應,這種行為被稱為諂媚行為。我們調查了RLHF訓練模型中諂媚行為的普遍性,以及人類偏好判斷是否負有責任。我們首先展示了五個最先進的人工智能助手在四個不同的自由文本生成任務中一貫表現出諂媚行為。為了了解人類偏好是否驅使RLHF模型這一廣泛觀察到的行為,我們分析了現有的人類偏好數據。我們發現,當回應與用戶觀點一致時,更有可能被偏好。此外,人類和偏好模型(PMs)在極少數情況下更喜歡寫得令人信服的諂媚回應而非正確的回應。優化模型輸出以滿足PMs有時會犧牲真實性以取悅諂媚行為。總的來說,我們的結果表明諂媚行為是RLHF模型的一種普遍行為,很可能部分受人類偏好判斷支持諂媚回應的影響。
English
Reinforcement learning from human feedback (RLHF) is a popular technique for
training high-quality AI assistants. However, RLHF may also encourage model
responses that match user beliefs over truthful responses, a behavior known as
sycophancy. We investigate the prevalence of sycophancy in RLHF-trained models
and whether human preference judgements are responsible. We first demonstrate
that five state-of-the-art AI assistants consistently exhibit sycophantic
behavior across four varied free-form text-generation tasks. To understand if
human preferences drive this broadly observed behavior of RLHF models, we
analyze existing human preference data. We find that when a response matches a
user's views, it is more likely to be preferred. Moreover, both humans and
preference models (PMs) prefer convincingly-written sycophantic responses over
correct ones a negligible fraction of the time. Optimizing model outputs
against PMs also sometimes sacrifices truthfulness in favor of sycophancy.
Overall, our results indicate that sycophancy is a general behavior of RLHF
models, likely driven in part by human preference judgements favoring
sycophantic responses.