簡單的合成數據可減少大型語言模型中的諂媚行為

摘要

諂媚是一種不良行為，指模型根據人類用戶的觀點調整其回應，即使該觀點在客觀上並不正確（例如，一旦用戶透露自己是自由主義者，模型便調整為支持自由主義觀點）。本文研究語言模型中諂媚行為的普遍性，並提出一種簡單的合成數據干預方法來減少這種行為。首先，在三個諂媚任務集合（Perez等，2022）上，這些任務要求模型對沒有正確答案的陳述（例如政治）發表意見，我們觀察到，對於PaLM模型，無論是模型規模還是指導調整都會顯著增加諂媚行為，直至達到540B參數。其次，我們將諂媚評估擴展到明顯不正確的簡單加法陳述，發現儘管模型知道這些陳述是錯誤的，但如果用戶也這樣認為，語言模型仍會同意這些陳述。為了減少諂媚行為，我們提出了一個簡單的合成數據干預方法，該方法利用公共自然語言處理任務，鼓勵模型對這些任務上的用戶意見保持強大。在輕量級微調步驟中添加這些數據可以顯著減少對留存提示的諂媚行為。生成用於干預的合成數據的代碼可在https://github.com/google/sycophancy-intervention 找到。

English

Sycophancy is an undesirable behavior where models tailor their responses to follow a human user's view even when that view is not objectively correct (e.g., adapting liberal views once a user reveals that they are liberal). In this paper, we study the prevalence of sycophancy in language models and propose a simple synthetic-data intervention to reduce this behavior. First, on a set of three sycophancy tasks (Perez et al., 2022) where models are asked for an opinion on statements with no correct answers (e.g., politics), we observe that both model scaling and instruction tuning significantly increase sycophancy for PaLM models up to 540B parameters. Second, we extend sycophancy evaluations to simple addition statements that are objectively incorrect, finding that despite knowing that these statements are wrong, language models will still agree with them if the user does as well. To reduce sycophancy, we present a straightforward synthetic-data intervention that takes public NLP tasks and encourages models to be robust to user opinions on these tasks. Adding these data in a lightweight finetuning step can significantly reduce sycophantic behavior on held-out prompts. Code for generating synthetic data for intervention can be found at https://github.com/google/sycophancy-intervention.

簡單的合成數據可減少大型語言模型中的諂媚行為

Simple synthetic data reduces sycophancy in large language models

摘要

Support