ChatPaper.aiChatPaper

从对决中得出结论:重新思考竞技场式大语言模型评估中的偏好语义

Drawing Conclusions from Draws: Rethinking Preference Semantics in Arena-Style LLM Evaluation

October 2, 2025
作者: Raphael Tang, Crystina Zhang, Wenyan Li, Carmen Lai, Pontus Stenetorp, Yao Lu
cs.AI

摘要

在大型语言模型(LLMs)的竞技场式评估中,两个LLMs对用户查询作出响应,用户选择获胜的响应或判定“对决”为平局,从而调整两个模型的评分。当前主流的方法是将这些对决视为双人游戏比赛,如国际象棋,并应用Elo评分系统及其衍生方法。本文对这一范式进行了批判性审视。具体而言,我们质疑平局是否真的意味着两个模型实力相当,因而它们的评分是否应当持平。相反,我们推测平局更多反映了查询的难度:如果查询过于简单,那么两个模型更可能同样成功。在三个真实世界的竞技场数据集上,我们发现,忽略平局时的评分更新,对于所研究的全部四种评分系统,对决结果预测准确率(包括平局)相对提升了1-3%。进一步分析表明,平局更常出现在被评定为非常容易和高度客观的查询中,风险比分别为1.37和1.35。我们建议未来的评分系统重新考虑现有的平局语义,并在评分更新中考虑查询属性。
English
In arena-style evaluation of large language models (LLMs), two LLMs respond to a user query, and the user chooses the winning response or deems the "battle" a draw, resulting in an adjustment to the ratings of both models. The prevailing approach for modeling these rating dynamics is to view battles as two-player game matches, as in chess, and apply the Elo rating system and its derivatives. In this paper, we critically examine this paradigm. Specifically, we question whether a draw genuinely means that the two models are equal and hence whether their ratings should be equalized. Instead, we conjecture that draws are more indicative of query difficulty: if the query is too easy, then both models are more likely to succeed equally. On three real-world arena datasets, we show that ignoring rating updates for draws yields a 1-3% relative increase in battle outcome prediction accuracy (which includes draws) for all four rating systems studied. Further analyses suggest that draws occur more for queries rated as very easy and those as highly objective, with risk ratios of 1.37 and 1.35, respectively. We recommend future rating systems to reconsider existing draw semantics and to account for query properties in rating updates.
PDF32October 3, 2025