從抽籤中得出結論：重新思考競技場式大語言模型評估中的偏好語義

摘要

在大型語言模型（LLMs）的競技場式評估中，兩個LLMs對用戶查詢作出回應，用戶選擇勝出的回應或判定「對決」為平局，從而調整兩個模型的評分。目前，模擬這些評分動態的主流方法是將對決視為雙人遊戲比賽，如國際象棋，並應用Elo評分系統及其衍生系統。本文中，我們對這一範式進行了批判性審視。具體而言，我們質疑平局是否真正意味著兩個模型實力相當，因而其評分是否應被等同。相反，我們推測平局更多反映了查詢的難度：若查詢過於簡單，則兩個模型更可能同等成功。在三個真實世界的競技場數據集上，我們展示了忽略平局時的評分更新，對於所有研究的四種評分系統，能帶來1-3%的相對提升，在包含平局的對決結果預測準確性上。進一步分析表明，平局更常發生於被評為非常簡單及高度客觀的查詢，其風險比分別為1.37和1.35。我們建議未來的評分系統重新考慮現有的平局語義，並在評分更新中考慮查詢屬性。

English

In arena-style evaluation of large language models (LLMs), two LLMs respond to a user query, and the user chooses the winning response or deems the "battle" a draw, resulting in an adjustment to the ratings of both models. The prevailing approach for modeling these rating dynamics is to view battles as two-player game matches, as in chess, and apply the Elo rating system and its derivatives. In this paper, we critically examine this paradigm. Specifically, we question whether a draw genuinely means that the two models are equal and hence whether their ratings should be equalized. Instead, we conjecture that draws are more indicative of query difficulty: if the query is too easy, then both models are more likely to succeed equally. On three real-world arena datasets, we show that ignoring rating updates for draws yields a 1-3% relative increase in battle outcome prediction accuracy (which includes draws) for all four rating systems studied. Further analyses suggest that draws occur more for queries rated as very easy and those as highly objective, with risk ratios of 1.37 and 1.35, respectively. We recommend future rating systems to reconsider existing draw semantics and to account for query properties in rating updates.

從抽籤中得出結論：重新思考競技場式大語言模型評估中的偏好語義

Drawing Conclusions from Draws: Rethinking Preference Semantics in Arena-Style LLM Evaluation

摘要

Support