從抽籤中得出結論:重新思考競技場式大語言模型評估中的偏好語義
Drawing Conclusions from Draws: Rethinking Preference Semantics in Arena-Style LLM Evaluation
October 2, 2025
作者: Raphael Tang, Crystina Zhang, Wenyan Li, Carmen Lai, Pontus Stenetorp, Yao Lu
cs.AI
摘要
在大型語言模型(LLMs)的競技場式評估中,兩個LLMs對用戶查詢作出回應,用戶選擇勝出的回應或判定「對決」為平局,從而調整兩個模型的評分。目前,模擬這些評分動態的主流方法是將對決視為雙人遊戲比賽,如國際象棋,並應用Elo評分系統及其衍生系統。本文中,我們對這一範式進行了批判性審視。具體而言,我們質疑平局是否真正意味著兩個模型實力相當,因而其評分是否應被等同。相反,我們推測平局更多反映了查詢的難度:若查詢過於簡單,則兩個模型更可能同等成功。在三個真實世界的競技場數據集上,我們展示了忽略平局時的評分更新,對於所有研究的四種評分系統,能帶來1-3%的相對提升,在包含平局的對決結果預測準確性上。進一步分析表明,平局更常發生於被評為非常簡單及高度客觀的查詢,其風險比分別為1.37和1.35。我們建議未來的評分系統重新考慮現有的平局語義,並在評分更新中考慮查詢屬性。
English
In arena-style evaluation of large language models (LLMs), two LLMs respond
to a user query, and the user chooses the winning response or deems the
"battle" a draw, resulting in an adjustment to the ratings of both models. The
prevailing approach for modeling these rating dynamics is to view battles as
two-player game matches, as in chess, and apply the Elo rating system and its
derivatives. In this paper, we critically examine this paradigm. Specifically,
we question whether a draw genuinely means that the two models are equal and
hence whether their ratings should be equalized. Instead, we conjecture that
draws are more indicative of query difficulty: if the query is too easy, then
both models are more likely to succeed equally. On three real-world arena
datasets, we show that ignoring rating updates for draws yields a 1-3% relative
increase in battle outcome prediction accuracy (which includes draws) for all
four rating systems studied. Further analyses suggest that draws occur more for
queries rated as very easy and those as highly objective, with risk ratios of
1.37 and 1.35, respectively. We recommend future rating systems to reconsider
existing draw semantics and to account for query properties in rating updates.