关于具有弱LLMs评判强LLMs的可扩展监督

摘要

可扩展的监督协议旨在使人类能够准确监督超智能人工智能。在本文中，我们研究辩论，其中两个人工智能竞争说服一名裁判；咨询，其中一个人工智能试图说服一个提问问题的裁判；并与直接问答基线进行比较，裁判直接回答问题而没有人工智能参与。我们使用大型语言模型（LLMs）作为两个人工智能代理和人类裁判的替代，将裁判模型视为比代理模型更弱。我们在裁判和代理之间的各种不对称性上进行基准测试，扩展了先前关于具有信息不对称性的单一抽取问答任务的工作，还包括数学、编码、逻辑和多模态推理不对称性。我们发现，在咨询中，当顾问被随机分配为支持正确/错误答案时，辩论在所有任务中表现优于咨询。将辩论与直接问答进行比较，结果取决于任务类型：在具有信息不对称性的抽取问答任务中，辩论优于直接问答，但在其他没有信息不对称性的任务中，结果则不尽相同。先前的工作将辩手/顾问分配一个答案来辩论。当我们允许他们选择要辩论的答案时，我们发现裁判在辩论中更少被错误答案说服，而在咨询中更容易被说服。此外，我们发现更强的辩手模型提高了裁判的准确性，尽管比先前的研究效果略微减弱。

English

Scalable oversight protocols aim to enable humans to accurately supervise superhuman AI. In this paper we study debate, where two AI's compete to convince a judge; consultancy, where a single AI tries to convince a judge that asks questions; and compare to a baseline of direct question-answering, where the judge just answers outright without the AI. We use large language models (LLMs) as both AI agents and as stand-ins for human judges, taking the judge models to be weaker than agent models. We benchmark on a diverse range of asymmetries between judges and agents, extending previous work on a single extractive QA task with information asymmetry, to also include mathematics, coding, logic and multimodal reasoning asymmetries. We find that debate outperforms consultancy across all tasks when the consultant is randomly assigned to argue for the correct/incorrect answer. Comparing debate to direct question answering, the results depend on the type of task: in extractive QA tasks with information asymmetry debate outperforms direct question answering, but in other tasks without information asymmetry the results are mixed. Previous work assigned debaters/consultants an answer to argue for. When we allow them to instead choose which answer to argue for, we find judges are less frequently convinced by the wrong answer in debate than in consultancy. Further, we find that stronger debater models increase judge accuracy, though more modestly than in previous studies.

关于具有弱LLMs评判强LLMs的可扩展监督

On scalable oversight with weak LLMs judging strong LLMs

摘要

Support