강력한 LLM을 약한 LLM이 평가하는 확장 가능한 감독에 관하여

초록

확장 가능한 감독 프로토콜은 인간이 초인공지능을 정확하게 감독할 수 있도록 하는 것을 목표로 합니다. 본 논문에서는 두 개의 AI가 판사(judge)를 설득하기 위해 경쟁하는 '토론(debate)' 방식과, 단일 AI가 질문을 하는 판사를 설득하려는 '컨설팅(consultancy)' 방식을 연구하며, 이를 AI 없이 판사가 직접 질문에 답하는 '직접 질문-응답(direct question-answering)' 방식과 비교합니다. 우리는 대형 언어 모델(LLM)을 AI 에이전트와 인간 판사의 대리자로 사용하며, 판사 모델을 에이전트 모델보다 약하게 설정합니다. 정보 비대칭을 가진 단일 추출형 질문-응답(extractive QA) 작업에서의 기존 연구를 확장하여, 수학, 코딩, 논리 및 다중모드 추론 비대칭을 포함한 다양한 비대칭 상황에서 벤치마크를 수행합니다. 우리는 컨설턴트가 올바른/잘못된 답을 주장하도록 무작위로 할당될 때, 모든 작업에서 토론 방식이 컨설팅 방식을 능가한다는 것을 발견했습니다. 토론 방식과 직접 질문-응답 방식을 비교할 때, 작업 유형에 따라 결과가 달라집니다: 정보 비대칭이 있는 추출형 QA 작업에서는 토론 방식이 직접 질문-응답 방식을 능가하지만, 정보 비대칭이 없는 다른 작업에서는 결과가 혼재됩니다. 기존 연구에서는 토론자/컨설턴트가 주장할 답을 할당했지만, 우리는 그들이 주장할 답을 선택하도록 허용했을 때, 판사가 잘못된 답에 설득되는 빈도가 토론 방식에서 컨설팅 방식보다 낮다는 것을 발견했습니다. 또한, 더 강력한 토론자 모델이 판사의 정확도를 높이지만, 이전 연구보다는 더 소폭 증가한다는 것을 확인했습니다.

English

Scalable oversight protocols aim to enable humans to accurately supervise superhuman AI. In this paper we study debate, where two AI's compete to convince a judge; consultancy, where a single AI tries to convince a judge that asks questions; and compare to a baseline of direct question-answering, where the judge just answers outright without the AI. We use large language models (LLMs) as both AI agents and as stand-ins for human judges, taking the judge models to be weaker than agent models. We benchmark on a diverse range of asymmetries between judges and agents, extending previous work on a single extractive QA task with information asymmetry, to also include mathematics, coding, logic and multimodal reasoning asymmetries. We find that debate outperforms consultancy across all tasks when the consultant is randomly assigned to argue for the correct/incorrect answer. Comparing debate to direct question answering, the results depend on the type of task: in extractive QA tasks with information asymmetry debate outperforms direct question answering, but in other tasks without information asymmetry the results are mixed. Previous work assigned debaters/consultants an answer to argue for. When we allow them to instead choose which answer to argue for, we find judges are less frequently convinced by the wrong answer in debate than in consultancy. Further, we find that stronger debater models increase judge accuracy, though more modestly than in previous studies.

강력한 LLM을 약한 LLM이 평가하는 확장 가능한 감독에 관하여

On scalable oversight with weak LLMs judging strong LLMs

초록

Support