可辯之智：透過辯論演說評估基準測試LLM裁判

摘要

我们引入辩论演讲评估作为一项新颖且具挑战性的基准，用以评估大型语言模型（LLM）作为评判者的能力。评估辩论演讲需深入理解演讲的多个层面，包括论点的力度与相关性、演讲的连贯性与组织性、其风格与语调的适宜性等。此任务涉及一系列独特的认知能力，这些能力在以往的系统性LLM基准测试中鲜少受到关注。为探索此类技能，我们利用了一个包含600余篇精细标注的辩论演讲数据集，并首次深入分析了顶尖LLM在此任务上与人类评判者的对比情况。我们的研究揭示了一幅微妙的图景：尽管更大规模的模型在某些方面能近似个别人类评判，但它们在整体评判行为上存在显著差异。同时，我们还探究了前沿LLM生成具有说服力、观点鲜明的演讲的能力，结果表明，模型在此任务上可能达到人类水平。

English

We introduce Debate Speech Evaluation as a novel and challenging benchmark for assessing LLM judges. Evaluating debate speeches requires a deep understanding of the speech at multiple levels, including argument strength and relevance, the coherence and organization of the speech, the appropriateness of its style and tone, and so on. This task involves a unique set of cognitive abilities that have previously received limited attention in systematic LLM benchmarking. To explore such skills, we leverage a dataset of over 600 meticulously annotated debate speeches and present the first in-depth analysis of how state-of-the-art LLMs compare to human judges on this task. Our findings reveal a nuanced picture: while larger models can approximate individual human judgments in some respects, they differ substantially in their overall judgment behavior. We also investigate the ability of frontier LLMs to generate persuasive, opinionated speeches, showing that models may perform at a human level on this task.

可辯之智：透過辯論演說評估基準測試LLM裁判

Debatable Intelligence: Benchmarking LLM Judges via Debate Speech Evaluation

摘要

Support