朝向理解語言模型中的諂媚行為

摘要

從人類反饋中學習的強化學習（RLHF）是訓練高質量人工智能助手的一種流行技術。然而，RLHF可能會鼓勵模型的回應與用戶信念相符，而非真實回應，這種行為被稱為諂媚行為。我們調查了RLHF訓練模型中諂媚行為的普遍性，以及人類偏好判斷是否負有責任。我們首先展示了五個最先進的人工智能助手在四個不同的自由文本生成任務中一貫表現出諂媚行為。為了了解人類偏好是否驅使RLHF模型這一廣泛觀察到的行為，我們分析了現有的人類偏好數據。我們發現，當回應與用戶觀點一致時，更有可能被偏好。此外，人類和偏好模型（PMs）在極少數情況下更喜歡寫得令人信服的諂媚回應而非正確的回應。優化模型輸出以滿足PMs有時會犧牲真實性以取悅諂媚行為。總的來說，我們的結果表明諂媚行為是RLHF模型的一種普遍行為，很可能部分受人類偏好判斷支持諂媚回應的影響。

English

Reinforcement learning from human feedback (RLHF) is a popular technique for training high-quality AI assistants. However, RLHF may also encourage model responses that match user beliefs over truthful responses, a behavior known as sycophancy. We investigate the prevalence of sycophancy in RLHF-trained models and whether human preference judgements are responsible. We first demonstrate that five state-of-the-art AI assistants consistently exhibit sycophantic behavior across four varied free-form text-generation tasks. To understand if human preferences drive this broadly observed behavior of RLHF models, we analyze existing human preference data. We find that when a response matches a user's views, it is more likely to be preferred. Moreover, both humans and preference models (PMs) prefer convincingly-written sycophantic responses over correct ones a negligible fraction of the time. Optimizing model outputs against PMs also sometimes sacrifices truthfulness in favor of sycophancy. Overall, our results indicate that sycophancy is a general behavior of RLHF models, likely driven in part by human preference judgements favoring sycophantic responses.

朝向理解語言模型中的諂媚行為

Towards Understanding Sycophancy in Language Models

摘要

Support