排行榜幻象
The Leaderboard Illusion
April 29, 2025
作者: Shivalika Singh, Yiyang Nan, Alex Wang, Daniel D'Souza, Sayash Kapoor, Ahmet Üstün, Sanmi Koyejo, Yuntian Deng, Shayne Longpre, Noah Smith, Beyza Ermis, Marzieh Fadaee, Sara Hooker
cs.AI
摘要
衡量進展是任何科學領域發展的基石。隨著基準測試扮演越來越核心的角色,它們也變得更容易受到扭曲。Chatbot Arena已成為排名最強AI系統的首選排行榜。然而,在本研究中,我們發現了一些系統性問題,導致了競爭場域的扭曲。我們發現,未公開的私人測試實踐使少數供應商受益,他們能夠在公開發布前測試多個變體,並在需要時撤回分數。我們證實,這些供應商選擇最佳分數的能力導致了Arena分數的偏差,因為他們選擇性地披露了性能結果。在極端情況下,我們發現Meta在Llama-4發布前測試了27個私人LLM變體。我們還證實,專有的閉源模型在Arena中被抽樣的比例(戰鬥次數)更高,並且被移除的模型數量少於開源和開放權重的替代方案。這兩種政策隨著時間的推移導致了巨大的數據訪問不對稱。像Google和OpenAI這樣的供應商分別獲得了Arena中估計19.2%和20.4%的數據。相比之下,83個開放權重模型合計僅獲得了估計29.7%的總數據。我們表明,訪問Chatbot Arena數據帶來了顯著的好處;根據我們的保守估計,即使是有限的額外數據,也可以在Arena分佈上實現高達112%的相對性能提升。這些動態共同導致了對Arena特定動態的過度擬合,而非一般模型質量。Arena建立在組織者和維護這一寶貴評估平台的開放社區的共同努力之上。我們提出了可行的建議,以改革Chatbot Arena的評估框架,並促進該領域更公平、更透明的基準測試。
English
Measuring progress is fundamental to the advancement of any scientific field.
As benchmarks play an increasingly central role, they also grow more
susceptible to distortion. Chatbot Arena has emerged as the go-to leaderboard
for ranking the most capable AI systems. Yet, in this work we identify
systematic issues that have resulted in a distorted playing field. We find that
undisclosed private testing practices benefit a handful of providers who are
able to test multiple variants before public release and retract scores if
desired. We establish that the ability of these providers to choose the best
score leads to biased Arena scores due to selective disclosure of performance
results. At an extreme, we identify 27 private LLM variants tested by Meta in
the lead-up to the Llama-4 release. We also establish that proprietary closed
models are sampled at higher rates (number of battles) and have fewer models
removed from the arena than open-weight and open-source alternatives. Both
these policies lead to large data access asymmetries over time. Providers like
Google and OpenAI have received an estimated 19.2% and 20.4% of all data on the
arena, respectively. In contrast, a combined 83 open-weight models have only
received an estimated 29.7% of the total data. We show that access to Chatbot
Arena data yields substantial benefits; even limited additional data can result
in relative performance gains of up to 112% on the arena distribution, based on
our conservative estimates. Together, these dynamics result in overfitting to
Arena-specific dynamics rather than general model quality. The Arena builds on
the substantial efforts of both the organizers and an open community that
maintains this valuable evaluation platform. We offer actionable
recommendations to reform the Chatbot Arena's evaluation framework and promote
fairer, more transparent benchmarking for the fieldSummary
AI-Generated Summary