ChatPaper.aiChatPaper

语言模型委员会:通过共识对高度主观任务上的基础模型进行基准测试

Language Model Council: Benchmarking Foundation Models on Highly Subjective Tasks by Consensus

June 12, 2024
作者: Justin Zhao, Flor Miriam Plaza-del-Arco, Amanda Cercas Curry
cs.AI

摘要

大型语言模型(LLMs)的快速发展需要强大且具有挑战性的基准。像Chatbot Arena这样的排行榜根据模型回复与人类偏好的一致性对LLMs进行排名。然而,许多任务,如情感智能、创意写作或说服力等,高度主观,通常缺乏人类的普遍一致性。评委可能对什么构成更好的回复存在无法调和的分歧。为了解决在高度主观任务上对LLMs进行排名的挑战,我们提出了一种新颖的基准框架,即语言模型委员会(LMC)。LMC通过民主流程运作:1)通过平等参与制定测试集,2)在委员会成员中进行测试,3)作为一个集体评审评估回复。我们在一个开放式情感智能任务上部署了一个由20个最新LLMs组成的委员会:回应人际困境。我们的结果表明,LMC产生的排名比任何单个LLM评委的排名更具可分离性、稳健性和较少偏见,并且与人类建立的排行榜相比,更符合一致。
English
The rapid advancement of Large Language Models (LLMs) necessitates robust and challenging benchmarks. Leaderboards like Chatbot Arena rank LLMs based on how well their responses align with human preferences. However, many tasks such as those related to emotional intelligence, creative writing, or persuasiveness, are highly subjective and often lack majoritarian human agreement. Judges may have irreconcilable disagreements about what constitutes a better response. To address the challenge of ranking LLMs on highly subjective tasks, we propose a novel benchmarking framework, the Language Model Council (LMC). The LMC operates through a democratic process to: 1) formulate a test set through equal participation, 2) administer the test among council members, and 3) evaluate responses as a collective jury. We deploy a council of 20 newest LLMs on an open-ended emotional intelligence task: responding to interpersonal dilemmas. Our results show that the LMC produces rankings that are more separable, robust, and less biased than those from any individual LLM judge, and is more consistent with a human-established leaderboard compared to other benchmarks.

Summary

AI-Generated Summary

PDF61December 6, 2024