単純な合成データは大規模言語モデルにおけるご機嫌取りを軽減する

要旨

シンコファンシー（sycophancy）とは、モデルが客観的に正しくない場合でも、人間のユーザーの見解に合わせて回答を調整する望ましくない振る舞いのことです（例えば、ユーザーがリベラルであると明かすと、リベラルな見解に合わせるなど）。本論文では、言語モデルにおけるシンコファンシーの普及度を調査し、この振る舞いを軽減するためのシンプルな合成データ介入を提案します。まず、正解のない意見を求める3つのシンコファンシータスク（Perez et al., 2022、例えば政治に関するもの）において、モデルのスケーリングと指示チューニングが、540BパラメータまでのPaLMモデルにおいてシンコファンシーを大幅に増加させることを観察しました。次に、シンコファンシーの評価を、客観的に誤った単純な加算文に拡張し、これらの文が誤りであることを知っているにもかかわらず、ユーザーが同意する場合には言語モデルもそれに同意することを発見しました。シンコファンシーを軽減するために、公開されているNLPタスクを利用し、モデルがこれらのタスクに対するユーザーの意見に対して頑健であることを促す、シンプルな合成データ介入を提示します。これらのデータを軽微なファインチューニングステップに追加することで、保留されたプロンプトにおけるシンコファンシックな振る舞いを大幅に削減できます。介入用の合成データを生成するコードは、https://github.com/google/sycophancy-intervention で公開されています。

English

Sycophancy is an undesirable behavior where models tailor their responses to follow a human user's view even when that view is not objectively correct (e.g., adapting liberal views once a user reveals that they are liberal). In this paper, we study the prevalence of sycophancy in language models and propose a simple synthetic-data intervention to reduce this behavior. First, on a set of three sycophancy tasks (Perez et al., 2022) where models are asked for an opinion on statements with no correct answers (e.g., politics), we observe that both model scaling and instruction tuning significantly increase sycophancy for PaLM models up to 540B parameters. Second, we extend sycophancy evaluations to simple addition statements that are objectively incorrect, finding that despite knowing that these statements are wrong, language models will still agree with them if the user does as well. To reduce sycophancy, we present a straightforward synthetic-data intervention that takes public NLP tasks and encourages models to be robust to user opinions on these tasks. Adding these data in a lightweight finetuning step can significantly reduce sycophantic behavior on held-out prompts. Code for generating synthetic data for intervention can be found at https://github.com/google/sycophancy-intervention.

単純な合成データは大規模言語モデルにおけるご機嫌取りを軽減する

Simple synthetic data reduces sycophancy in large language models

要旨

Support