排行榜幻觉
The Leaderboard Illusion
April 29, 2025
作者: Shivalika Singh, Yiyang Nan, Alex Wang, Daniel D'Souza, Sayash Kapoor, Ahmet Üstün, Sanmi Koyejo, Yuntian Deng, Shayne Longpre, Noah Smith, Beyza Ermis, Marzieh Fadaee, Sara Hooker
cs.AI
摘要
衡量进展是推动任何科学领域发展的基石。随着基准测试日益占据核心地位,它们也更容易受到扭曲。Chatbot Arena已成为评估最强大AI系统的首选排行榜。然而,在本研究中,我们揭示了一系列导致竞技场失真的系统性问题。我们发现,未公开的私下测试实践使少数供应商受益,他们能够在公开发布前测试多个版本,并在需要时撤回评分。我们证实,这些供应商选择最佳评分的能力,由于选择性披露性能结果,导致了Arena评分的偏差。极端情况下,我们识别出Meta在Llama-4发布前测试的27个私有LLM变体。同时,我们证实,专有的闭源模型在Arena中被抽样(对战次数)的频率更高,且相较于开源和开放权重替代品,被移除的模型更少。这两项政策长期导致了巨大的数据获取不对称性。例如,谷歌和OpenAI分别估计获得了Arena中19.2%和20.4%的数据,而83个开放权重模型合计仅获得了约29.7%的总数据。我们展示,获取Chatbot Arena数据能带来显著优势;根据我们的保守估计,即使有限的额外数据也能在Arena分布上实现高达112%的相对性能提升。这些动态共同导致了模型过度适应Arena特定环境,而非提升整体模型质量。Arena的建立离不开组织者和维护这一宝贵评估平台的开放社区的辛勤努力。我们提出可操作的建议,旨在改革Chatbot Arena的评估框架,推动该领域实现更公平、透明的基准测试。
English
Measuring progress is fundamental to the advancement of any scientific field.
As benchmarks play an increasingly central role, they also grow more
susceptible to distortion. Chatbot Arena has emerged as the go-to leaderboard
for ranking the most capable AI systems. Yet, in this work we identify
systematic issues that have resulted in a distorted playing field. We find that
undisclosed private testing practices benefit a handful of providers who are
able to test multiple variants before public release and retract scores if
desired. We establish that the ability of these providers to choose the best
score leads to biased Arena scores due to selective disclosure of performance
results. At an extreme, we identify 27 private LLM variants tested by Meta in
the lead-up to the Llama-4 release. We also establish that proprietary closed
models are sampled at higher rates (number of battles) and have fewer models
removed from the arena than open-weight and open-source alternatives. Both
these policies lead to large data access asymmetries over time. Providers like
Google and OpenAI have received an estimated 19.2% and 20.4% of all data on the
arena, respectively. In contrast, a combined 83 open-weight models have only
received an estimated 29.7% of the total data. We show that access to Chatbot
Arena data yields substantial benefits; even limited additional data can result
in relative performance gains of up to 112% on the arena distribution, based on
our conservative estimates. Together, these dynamics result in overfitting to
Arena-specific dynamics rather than general model quality. The Arena builds on
the substantial efforts of both the organizers and an open community that
maintains this valuable evaluation platform. We offer actionable
recommendations to reform the Chatbot Arena's evaluation framework and promote
fairer, more transparent benchmarking for the field