ChatPaper.aiChatPaper

语言模型通过RLHF学会误导人类。

Language Models Learn to Mislead Humans via RLHF

September 19, 2024
作者: Jiaxin Wen, Ruiqi Zhong, Akbir Khan, Ethan Perez, Jacob Steinhardt, Minlie Huang, Samuel R. Boman, He He, Shi Feng
cs.AI

摘要

语言模型(LMs)可能会产生难以被人类检测到的错误,特别是当任务复杂时。RLHF,最流行的后训练方法,可能会加剧这一问题:为了获得更高的奖励,LMs可能会更擅长说服人类,即使它们是错误的。我们在一个标准的RLHF流程下研究了这一现象,将其称为“U-SOPHISTRY”,因为这是模型开发者意外造成的。具体来说,我们要求时间受限(例如3-10分钟)的人类受试者评估模型输出的正确性,并计算人类相对于黄金标签的准确性。在一个问答任务(QuALITY)和编程任务(APPS)中,RLHF使LMs更擅长说服我们的受试者,但并没有更擅长正确完成任务。RLHF还使模型更难评估:在QuALITY上,我们受试者的误报率增加了24.1%,在APPS上增加了18.3%。最后,我们展示了探测,一种用于检测有意诡辩(例如后门LMs)的最先进方法,无法推广到U-SOPHISTRY。我们的结果突显了RLHF的一个重要失败模式,并呼吁进行更多研究来帮助人类与之保持一致。
English
Language models (LMs) can produce errors that are hard to detect for humans, especially when the task is complex. RLHF, the most popular post-training method, may exacerbate this problem: to achieve higher rewards, LMs might get better at convincing humans that they are right even when they are wrong. We study this phenomenon under a standard RLHF pipeline, calling it "U-SOPHISTRY" since it is Unintended by model developers. Specifically, we ask time-constrained (e.g., 3-10 minutes) human subjects to evaluate the correctness of model outputs and calculate humans' accuracy against gold labels. On a question-answering task (QuALITY) and programming task (APPS), RLHF makes LMs better at convincing our subjects but not at completing the task correctly. RLHF also makes the model harder to evaluate: our subjects' false positive rate increases by 24.1% on QuALITY and 18.3% on APPS. Finally, we show that probing, a state-of-the-art approach for detecting Intended Sophistry (e.g. backdoored LMs), does not generalize to U-SOPHISTRY. Our results highlight an important failure mode of RLHF and call for more research in assisting humans to align them.

Summary

AI-Generated Summary

PDF102November 16, 2024