可争议的智能：通过辩论演讲评估大语言模型法官的基准测试

摘要

我们提出辩论演讲评估作为一项新颖且富有挑战性的基准，用于测试大型语言模型（LLM）作为评判者的能力。评估辩论演讲需深入理解演讲的多个层面，包括论点的力度与相关性、演讲的连贯性与组织结构、风格与语调的适宜性等。这一任务涉及一系列独特的认知能力，这些能力在以往的系统性LLM基准测试中较少受到关注。为探究此类技能，我们利用了一个包含600多篇精细标注的辩论演讲数据集，并首次深入分析了顶尖LLM在此任务上与人类评判者的对比情况。我们的研究揭示了一个细致入微的图景：尽管更大规模的模型在某些方面能近似于个别人类评判，但它们在整体评判行为上存在显著差异。此外，我们还探讨了前沿LLM生成具有说服力、观点鲜明的演讲的能力，表明模型在此任务上可能达到人类水平。

English

We introduce Debate Speech Evaluation as a novel and challenging benchmark for assessing LLM judges. Evaluating debate speeches requires a deep understanding of the speech at multiple levels, including argument strength and relevance, the coherence and organization of the speech, the appropriateness of its style and tone, and so on. This task involves a unique set of cognitive abilities that have previously received limited attention in systematic LLM benchmarking. To explore such skills, we leverage a dataset of over 600 meticulously annotated debate speeches and present the first in-depth analysis of how state-of-the-art LLMs compare to human judges on this task. Our findings reveal a nuanced picture: while larger models can approximate individual human judgments in some respects, they differ substantially in their overall judgment behavior. We also investigate the ability of frontier LLMs to generate persuasive, opinionated speeches, showing that models may perform at a human level on this task.

可争议的智能：通过辩论演讲评估大语言模型法官的基准测试

Debatable Intelligence: Benchmarking LLM Judges via Debate Speech Evaluation

摘要

Support