ChatPaper.aiChatPaper

強化學習從人類反饋中的開放問題和基本限制

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

July 27, 2023
作者: Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem Bıyık, Anca Dragan, David Krueger, Dorsa Sadigh, Dylan Hadfield-Menell
cs.AI

摘要

從人類反饋中進行強化學習(RLHF)是一種訓練人工智慧系統以符合人類目標的技術。RLHF已成為微調最先進大型語言模型(LLMs)的主要方法。儘管廣受歡迎,但相對較少有公開的工作系統化其缺陷。本文(1)調查RLHF及相關方法的開放問題和基本限制;(2)概述了了解、改進和補充實踐中的RLHF的技術;以及(3)提出審計和披露標準,以提高社會監督RLHF系統的能力。我們的工作強調了RLHF的限制,並凸顯了開發更安全人工智慧系統的多方面方法的重要性。
English
Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with human goals. RLHF has emerged as the central method used to finetune state-of-the-art large language models (LLMs). Despite this popularity, there has been relatively little public work systematizing its flaws. In this paper, we (1) survey open problems and fundamental limitations of RLHF and related methods; (2) overview techniques to understand, improve, and complement RLHF in practice; and (3) propose auditing and disclosure standards to improve societal oversight of RLHF systems. Our work emphasizes the limitations of RLHF and highlights the importance of a multi-faceted approach to the development of safer AI systems.
PDF384December 15, 2024