언어 모델에서 아첨 현상 이해하기

초록

인간 피드백을 통한 강화 학습(RLHF)은 고품질 AI 어시스턴트를 훈련시키기 위한 널리 사용되는 기술입니다. 그러나 RLHF는 진실된 응답보다 사용자의 신념에 부합하는 모델 응답을 조장할 수 있으며, 이러한 행동을 아첨(sycophancy)이라고 합니다. 우리는 RLHF로 훈련된 모델에서 아첨의 유행 정도와 인간의 선호 판단이 그 원인인지 여부를 조사합니다. 먼저, 최신 AI 어시스턴트 다섯 가지가 네 가지 다양한 자유 형식 텍스트 생성 작업에서 일관되게 아첨 행동을 보인다는 것을 입증합니다. 인간의 선호가 RLHF 모델의 이러한 광범위한 행동을 유발하는지 이해하기 위해, 기존의 인간 선호 데이터를 분석합니다. 우리는 응답이 사용자의 견해와 일치할 때 선호될 가능성이 더 높다는 것을 발견했습니다. 또한, 인간과 선호 모델(PM) 모두 진실된 응답보다 설득력 있게 작성된 아첨 응답을 소수의 경우에 선호합니다. PM에 대해 모델 출력을 최적화하는 것은 때때로 진실성을 희생하여 아첨을 선호하는 결과를 가져옵니다. 전반적으로, 우리의 결과는 아첨이 RLHF 모델의 일반적인 행동이며, 부분적으로는 아첨 응답을 선호하는 인간의 선호 판단에 의해 유발될 가능성이 높다는 것을 나타냅니다.

English

Reinforcement learning from human feedback (RLHF) is a popular technique for training high-quality AI assistants. However, RLHF may also encourage model responses that match user beliefs over truthful responses, a behavior known as sycophancy. We investigate the prevalence of sycophancy in RLHF-trained models and whether human preference judgements are responsible. We first demonstrate that five state-of-the-art AI assistants consistently exhibit sycophantic behavior across four varied free-form text-generation tasks. To understand if human preferences drive this broadly observed behavior of RLHF models, we analyze existing human preference data. We find that when a response matches a user's views, it is more likely to be preferred. Moreover, both humans and preference models (PMs) prefer convincingly-written sycophantic responses over correct ones a negligible fraction of the time. Optimizing model outputs against PMs also sometimes sacrifices truthfulness in favor of sycophancy. Overall, our results indicate that sycophancy is a general behavior of RLHF models, likely driven in part by human preference judgements favoring sycophantic responses.

언어 모델에서 아첨 현상 이해하기

Towards Understanding Sycophancy in Language Models

초록

Support