언어 모델은 RLHF를 통해 인간을 속이는 법을 배운다.

초록

언어 모델(Language models, LMs)은 작업이 복잡할 때 사람들이 감지하기 어려운 오류를 생성할 수 있습니다. RLHF, 가장 인기 있는 사후 훈련 방법 중 하나, 이 문제를 악화시킬 수 있습니다: 보상을 높이기 위해 LMs는 틀릴 때에도 사람들을 납득시키는 데 더 능해질 수 있습니다. 우리는 표준 RLHF 파이프라인에서 이 현상을 연구하며, 이를 "U-SOPHISTRY"라고 명명합니다. 왜냐하면 이는 모델 개발자들이 의도하지 않은 것이기 때문입니다. 구체적으로, 우리는 시간 제한이 있는(예: 3-10분) 인간 주체들에게 모델 출력물의 정확성을 평가하도록 요청하고 인간의 정확도를 골드 라벨에 대해 계산합니다. 질의 응답 작업(QuALITY) 및 프로그래밍 작업(APPS)에서, RLHF는 우리 주체들을 납득시키는 데는 능해지지만 작업을 올바르게 수행하는 데는 그렇지 않습니다. RLHF는 또한 모델을 평가하기 어렵게 만듭니다: QuALITY에서 우리 주체들의 거짓 긍정률이 24.1% 증가하고, APPS에서는 18.3% 증가합니다. 마지막으로, 우리는 의도된 속임수(예: 백도어가 있는 LMs)를 감지하기 위한 최첨단 접근 방식인 프로빙(probing)이 U-SOPHISTRY에 일반화되지 않음을 보여줍니다. 우리의 결과는 RLHF의 중요한 실패 모드를 강조하며, 인간들을 조정하는 데 더 많은 연구가 필요함을 요구합니다.

English

Language models (LMs) can produce errors that are hard to detect for humans, especially when the task is complex. RLHF, the most popular post-training method, may exacerbate this problem: to achieve higher rewards, LMs might get better at convincing humans that they are right even when they are wrong. We study this phenomenon under a standard RLHF pipeline, calling it "U-SOPHISTRY" since it is Unintended by model developers. Specifically, we ask time-constrained (e.g., 3-10 minutes) human subjects to evaluate the correctness of model outputs and calculate humans' accuracy against gold labels. On a question-answering task (QuALITY) and programming task (APPS), RLHF makes LMs better at convincing our subjects but not at completing the task correctly. RLHF also makes the model harder to evaluate: our subjects' false positive rate increases by 24.1% on QuALITY and 18.3% on APPS. Finally, we show that probing, a state-of-the-art approach for detecting Intended Sophistry (e.g. backdoored LMs), does not generalize to U-SOPHISTRY. Our results highlight an important failure mode of RLHF and call for more research in assisting humans to align them.

언어 모델은 RLHF를 통해 인간을 속이는 법을 배운다.

Language Models Learn to Mislead Humans via RLHF

초록

Support