在具有弱LLM評估強LLM的可擴展監督方面
On scalable oversight with weak LLMs judging strong LLMs
July 5, 2024
作者: Zachary Kenton, Noah Y. Siegel, János Kramár, Jonah Brown-Cohen, Samuel Albanie, Jannis Bulian, Rishabh Agarwal, David Lindner, Yunhao Tang, Noah D. Goodman, Rohin Shah
cs.AI
摘要
可擴展的監督協議旨在使人類能夠準確監督超級智能人工智能。本文研究辯論,其中兩個人工智能競爭說服一位裁判;諮詢,其中一個人工智能試圖說服提問的裁判;並與直接問答的基準進行比較,其中裁判直接回答而不需要人工智能。我們使用大型語言模型(LLMs)作為兩個人工智能代理和人類裁判的替身,將裁判模型視為比代理模型更弱。我們在裁判和代理之間的各種不對稱性上進行基準測試,擴展了先前在單一提取式問答任務中的信息不對稱性工作,也包括數學、編碼、邏輯和多模態推理不對稱性。我們發現,當諮詢顧問被隨機分配為辯論正確/不正確答案時,辯論在所有任務中優於諮詢。將辯論與直接問答進行比較,結果取決於任務類型:在具有信息不對稱性的提取式問答任務中,辯論優於直接問答,但在其他沒有信息不對稱性的任務中,結果則不一。先前的工作將辯論者/顧問指定一個答案進行辯論。當我們允許他們選擇要辯論的答案時,我們發現在辯論中,裁判被錯誤答案說服的頻率比在諮詢中少。此外,我們發現更強的辯論模型提高了裁判的準確性,儘管比先前的研究要遜色。
English
Scalable oversight protocols aim to enable humans to accurately supervise
superhuman AI. In this paper we study debate, where two AI's compete to
convince a judge; consultancy, where a single AI tries to convince a judge that
asks questions; and compare to a baseline of direct question-answering, where
the judge just answers outright without the AI. We use large language models
(LLMs) as both AI agents and as stand-ins for human judges, taking the judge
models to be weaker than agent models. We benchmark on a diverse range of
asymmetries between judges and agents, extending previous work on a single
extractive QA task with information asymmetry, to also include mathematics,
coding, logic and multimodal reasoning asymmetries. We find that debate
outperforms consultancy across all tasks when the consultant is randomly
assigned to argue for the correct/incorrect answer. Comparing debate to direct
question answering, the results depend on the type of task: in extractive QA
tasks with information asymmetry debate outperforms direct question answering,
but in other tasks without information asymmetry the results are mixed.
Previous work assigned debaters/consultants an answer to argue for. When we
allow them to instead choose which answer to argue for, we find judges are less
frequently convinced by the wrong answer in debate than in consultancy.
Further, we find that stronger debater models increase judge accuracy, though
more modestly than in previous studies.Summary
AI-Generated Summary