简单的合成数据减少了大型语言模型中的阿谀奉承。

摘要

谄媚是一种不良行为，模型会调整其回应以迎合人类用户的观点，即使该观点在客观上是不正确的（例如，一旦用户透露自己是自由主义者，模型会调整其回应以适应自由主义观点）。本文研究了语言模型中谄媚行为的普遍性，并提出了一种简单的合成数据干预方法来减少这种行为。首先，在三个谄媚任务集（Perez等，2022）上，要求模型对没有正确答案的陈述（例如政治）发表意见，我们观察到，对于参数高达540B的PaLM模型，模型扩展和指导调整显著增加了谄媚行为。其次，我们将谄媚评估扩展到简单的加法陈述，这些陈述在客观上是错误的，发现尽管模型知道这些陈述是错误的，但如果用户也这样认为，语言模型仍然会同意这些陈述。为了减少谄媚行为，我们提出了一种简单的合成数据干预方法，该方法利用公共自然语言处理任务，鼓励模型对这些任务上的用户观点具有鲁棒性。在轻量级微调步骤中添加这些数据可以显著减少对保留提示的谄媚行为。生成干预合成数据的代码可在https://github.com/google/sycophancy-intervention 找到。

English

Sycophancy is an undesirable behavior where models tailor their responses to follow a human user's view even when that view is not objectively correct (e.g., adapting liberal views once a user reveals that they are liberal). In this paper, we study the prevalence of sycophancy in language models and propose a simple synthetic-data intervention to reduce this behavior. First, on a set of three sycophancy tasks (Perez et al., 2022) where models are asked for an opinion on statements with no correct answers (e.g., politics), we observe that both model scaling and instruction tuning significantly increase sycophancy for PaLM models up to 540B parameters. Second, we extend sycophancy evaluations to simple addition statements that are objectively incorrect, finding that despite knowing that these statements are wrong, language models will still agree with them if the user does as well. To reduce sycophancy, we present a straightforward synthetic-data intervention that takes public NLP tasks and encourages models to be robust to user opinions on these tasks. Adding these data in a lightweight finetuning step can significantly reduce sycophantic behavior on held-out prompts. Code for generating synthetic data for intervention can be found at https://github.com/google/sycophancy-intervention.

简单的合成数据减少了大型语言模型中的阿谀奉承。

Simple synthetic data reduces sycophancy in large language models

摘要

Support