排行榜幻象

摘要

衡量進展是任何科學領域發展的基石。隨著基準測試扮演越來越核心的角色，它們也變得更容易受到扭曲。Chatbot Arena已成為排名最強AI系統的首選排行榜。然而，在本研究中，我們發現了一些系統性問題，導致了競爭場域的扭曲。我們發現，未公開的私人測試實踐使少數供應商受益，他們能夠在公開發布前測試多個變體，並在需要時撤回分數。我們證實，這些供應商選擇最佳分數的能力導致了Arena分數的偏差，因為他們選擇性地披露了性能結果。在極端情況下，我們發現Meta在Llama-4發布前測試了27個私人LLM變體。我們還證實，專有的閉源模型在Arena中被抽樣的比例（戰鬥次數）更高，並且被移除的模型數量少於開源和開放權重的替代方案。這兩種政策隨著時間的推移導致了巨大的數據訪問不對稱。像Google和OpenAI這樣的供應商分別獲得了Arena中估計19.2%和20.4%的數據。相比之下，83個開放權重模型合計僅獲得了估計29.7%的總數據。我們表明，訪問Chatbot Arena數據帶來了顯著的好處；根據我們的保守估計，即使是有限的額外數據，也可以在Arena分佈上實現高達112%的相對性能提升。這些動態共同導致了對Arena特定動態的過度擬合，而非一般模型質量。Arena建立在組織者和維護這一寶貴評估平台的開放社區的共同努力之上。我們提出了可行的建議，以改革Chatbot Arena的評估框架，並促進該領域更公平、更透明的基準測試。

English

Measuring progress is fundamental to the advancement of any scientific field. As benchmarks play an increasingly central role, they also grow more susceptible to distortion. Chatbot Arena has emerged as the go-to leaderboard for ranking the most capable AI systems. Yet, in this work we identify systematic issues that have resulted in a distorted playing field. We find that undisclosed private testing practices benefit a handful of providers who are able to test multiple variants before public release and retract scores if desired. We establish that the ability of these providers to choose the best score leads to biased Arena scores due to selective disclosure of performance results. At an extreme, we identify 27 private LLM variants tested by Meta in the lead-up to the Llama-4 release. We also establish that proprietary closed models are sampled at higher rates (number of battles) and have fewer models removed from the arena than open-weight and open-source alternatives. Both these policies lead to large data access asymmetries over time. Providers like Google and OpenAI have received an estimated 19.2% and 20.4% of all data on the arena, respectively. In contrast, a combined 83 open-weight models have only received an estimated 29.7% of the total data. We show that access to Chatbot Arena data yields substantial benefits; even limited additional data can result in relative performance gains of up to 112% on the arena distribution, based on our conservative estimates. Together, these dynamics result in overfitting to Arena-specific dynamics rather than general model quality. The Arena builds on the substantial efforts of both the organizers and an open community that maintains this valuable evaluation platform. We offer actionable recommendations to reform the Chatbot Arena's evaluation framework and promote fairer, more transparent benchmarking for the field

排行榜幻象

The Leaderboard Illusion

摘要

Support