語言模型委員會：通過共識對高度主觀任務的基礎模型進行基準測試

摘要

大型語言模型（LLMs）的快速發展需要堅固且具挑戰性的基準。像是Chatbot Arena這樣的排行榜根據模型回應與人類喜好的一致程度來評分LLMs。然而，許多任務，如情感智能、創意寫作或說服力等，都具高度主觀性，並且常常缺乏廣泛人類一致意見。評審可能對於何為更好的回應存在無法調和的分歧。為了應對在高度主觀性任務上評分LLMs的挑戰，我們提出了一個新穎的基準框架，即語言模型委員會（LMC）。LMC通過民主程序運作，以：1）透過平等參與制定測試集，2）在委員會成員間進行測試，以及3）作為集體陪審團評估回應。我們在一個開放式情感智能任務上部署了一個由20個最新LLMs組成的委員會：回應人際困境。我們的結果顯示，LMC產生的排名比任何單個LLM評審更具可分辨性、穩健性和較少偏見，並且與人類建立的排行榜更一致，相較於其他基準。

English

The rapid advancement of Large Language Models (LLMs) necessitates robust and challenging benchmarks. Leaderboards like Chatbot Arena rank LLMs based on how well their responses align with human preferences. However, many tasks such as those related to emotional intelligence, creative writing, or persuasiveness, are highly subjective and often lack majoritarian human agreement. Judges may have irreconcilable disagreements about what constitutes a better response. To address the challenge of ranking LLMs on highly subjective tasks, we propose a novel benchmarking framework, the Language Model Council (LMC). The LMC operates through a democratic process to: 1) formulate a test set through equal participation, 2) administer the test among council members, and 3) evaluate responses as a collective jury. We deploy a council of 20 newest LLMs on an open-ended emotional intelligence task: responding to interpersonal dilemmas. Our results show that the LMC produces rankings that are more separable, robust, and less biased than those from any individual LLM judge, and is more consistent with a human-established leaderboard compared to other benchmarks.

語言模型委員會：通過共識對高度主觀任務的基礎模型進行基準測試

Language Model Council: Benchmarking Foundation Models on Highly Subjective Tasks by Consensus

摘要

Support