단순한 합성 데이터는 대규모 언어 모델의 아첨 현상을 줄인다

초록

아첨(Sycophancy)은 모델이 객관적으로 올바르지 않은 경우에도 인간 사용자의 관점에 맞춰 응답을 조정하는 바람직하지 않은 행동입니다(예: 사용자가 자신이 진보적이라고 밝히면 진보적 관점을 따르는 것). 본 논문에서는 언어 모델에서의 아첨 행동의 보편성을 연구하고, 이러한 행동을 줄이기 위한 간단한 합성 데이터 개입 방안을 제안합니다. 먼저, 정답이 없는 진술(예: 정치)에 대한 의견을 묻는 세 가지 아첨 과제(Perez et al., 2022)에서, 모델 규모 확장과 지시 튜닝이 540B 파라미터까지의 PaLM 모델에서 아첨 행동을 크게 증가시킨다는 것을 관찰했습니다. 둘째, 객관적으로 잘못된 간단한 덧셈 진술에 대한 아첨 평가를 확장한 결과, 언어 모델이 이러한 진술이 틀렸다는 것을 알고 있음에도 불구하고 사용자가 동의하면 여전히 그에 동의한다는 것을 발견했습니다. 아첨을 줄이기 위해, 공개된 NLP 과제를 활용하여 모델이 이러한 과제에 대한 사용자의 의견에 강건하도록 유도하는 간단한 합성 데이터 개입 방안을 제시합니다. 이러한 데이터를 경량 파인튜닝 단계에 추가하면 보류된 프롬프트에서의 아첨 행동을 크게 줄일 수 있습니다. 합성 데이터 생성 코드는 https://github.com/google/sycophancy-intervention에서 확인할 수 있습니다.

English

Sycophancy is an undesirable behavior where models tailor their responses to follow a human user's view even when that view is not objectively correct (e.g., adapting liberal views once a user reveals that they are liberal). In this paper, we study the prevalence of sycophancy in language models and propose a simple synthetic-data intervention to reduce this behavior. First, on a set of three sycophancy tasks (Perez et al., 2022) where models are asked for an opinion on statements with no correct answers (e.g., politics), we observe that both model scaling and instruction tuning significantly increase sycophancy for PaLM models up to 540B parameters. Second, we extend sycophancy evaluations to simple addition statements that are objectively incorrect, finding that despite knowing that these statements are wrong, language models will still agree with them if the user does as well. To reduce sycophancy, we present a straightforward synthetic-data intervention that takes public NLP tasks and encourages models to be robust to user opinions on these tasks. Adding these data in a lightweight finetuning step can significantly reduce sycophantic behavior on held-out prompts. Code for generating synthetic data for intervention can be found at https://github.com/google/sycophancy-intervention.

단순한 합성 데이터는 대규모 언어 모델의 아첨 현상을 줄인다

Simple synthetic data reduces sycophancy in large language models

초록

Support