SciArena:面向科学文献任务的基础模型开放评估平台
SciArena: An Open Evaluation Platform for Foundation Models in Scientific Literature Tasks
July 1, 2025
作者: Yilun Zhao, Kaiyan Zhang, Tiansheng Hu, Sihong Wu, Ronan Le Bras, Taira Anderson, Jonathan Bragg, Joseph Chee Chang, Jesse Dodge, Matt Latzke, Yixin Liu, Charles McGrady, Xiangru Tang, Zihang Wang, Chen Zhao, Hannaneh Hajishirzi, Doug Downey, Arman Cohan
cs.AI
摘要
我们推出SciArena,一个开放协作的平台,旨在评估基础模型在科学文献任务上的表现。与传统的科学文献理解与综合基准不同,SciArena直接邀请研究社区参与,采用类似Chatbot Arena的社区投票模型比较评估方法。通过汇聚集体智慧,SciArena提供了一种社区驱动的评估方式,针对需要基于文献的长篇回答的开放式科学任务。该平台目前支持23个开源及专有的基础模型,并已收集来自不同科学领域可信研究者的超过13,000次投票。我们对已收集的数据进行分析,确认提交的问题具有多样性,与实际文献需求相符,且参与研究者在评估中展现出高度的自我一致性和标注者间一致性。我们基于模型排名榜单讨论结果与洞见。为进一步推动基于模型的文献任务自动化评估系统研究,我们发布了SciArena-Eval,这是一个基于我们收集的偏好数据的元评估基准。该基准通过比较模型的成对评估与人类投票,衡量模型在判断答案质量上的准确性。我们的实验揭示了基准的挑战,并强调了开发更可靠自动化评估方法的必要性。
English
We present SciArena, an open and collaborative platform for evaluating
foundation models on scientific literature tasks. Unlike traditional benchmarks
for scientific literature understanding and synthesis, SciArena engages the
research community directly, following the Chatbot Arena evaluation approach of
community voting on model comparisons. By leveraging collective intelligence,
SciArena offers a community-driven evaluation of model performance on
open-ended scientific tasks that demand literature-grounded, long-form
responses. The platform currently supports 23 open-source and proprietary
foundation models and has collected over 13,000 votes from trusted researchers
across diverse scientific domains. We analyze the data collected so far and
confirm that the submitted questions are diverse, aligned with real-world
literature needs, and that participating researchers demonstrate strong
self-consistency and inter-annotator agreement in their evaluations. We discuss
the results and insights based on the model ranking leaderboard. To further
promote research in building model-based automated evaluation systems for
literature tasks, we release SciArena-Eval, a meta-evaluation benchmark based
on our collected preference data. The benchmark measures the accuracy of models
in judging answer quality by comparing their pairwise assessments with human
votes. Our experiments highlight the benchmark's challenges and emphasize the
need for more reliable automated evaluation methods.