ChatPaper.aiChatPaper

SciArena:科學文獻任務中基礎模型的開放評估平台

SciArena: An Open Evaluation Platform for Foundation Models in Scientific Literature Tasks

July 1, 2025
作者: Yilun Zhao, Kaiyan Zhang, Tiansheng Hu, Sihong Wu, Ronan Le Bras, Taira Anderson, Jonathan Bragg, Joseph Chee Chang, Jesse Dodge, Matt Latzke, Yixin Liu, Charles McGrady, Xiangru Tang, Zihang Wang, Chen Zhao, Hannaneh Hajishirzi, Doug Downey, Arman Cohan
cs.AI

摘要

我們推出SciArena,這是一個開放且協作式的平台,用於評估基礎模型在科學文獻任務上的表現。與傳統的科學文獻理解與綜合基準不同,SciArena直接吸引研究社群參與,採用Chatbot Arena的評估方法,即由社群對模型比較進行投票。通過利用集體智慧,SciArena提供了一個社群驅動的評估,針對需要基於文獻的長篇回應的開放式科學任務進行模型性能評價。該平台目前支持23個開源和專有的基礎模型,並已收集來自不同科學領域的受信任研究者的超過13,000次投票。我們分析了迄今為止收集的數據,確認提交的問題具有多樣性,與現實世界的文獻需求相符,並且參與研究者在評估中展現出強烈的自我一致性和評分者間一致性。我們基於模型排名榜單討論了結果和洞察。為了進一步促進基於模型的自動評估系統在文獻任務中的研究,我們發布了SciArena-Eval,這是一個基於我們收集的偏好數據的元評估基準。該基準通過比較模型的成對評估與人類投票,來衡量模型在判斷答案質量上的準確性。我們的實驗凸顯了該基準的挑戰,並強調了需要更可靠的自動評估方法。
English
We present SciArena, an open and collaborative platform for evaluating foundation models on scientific literature tasks. Unlike traditional benchmarks for scientific literature understanding and synthesis, SciArena engages the research community directly, following the Chatbot Arena evaluation approach of community voting on model comparisons. By leveraging collective intelligence, SciArena offers a community-driven evaluation of model performance on open-ended scientific tasks that demand literature-grounded, long-form responses. The platform currently supports 23 open-source and proprietary foundation models and has collected over 13,000 votes from trusted researchers across diverse scientific domains. We analyze the data collected so far and confirm that the submitted questions are diverse, aligned with real-world literature needs, and that participating researchers demonstrate strong self-consistency and inter-annotator agreement in their evaluations. We discuss the results and insights based on the model ranking leaderboard. To further promote research in building model-based automated evaluation systems for literature tasks, we release SciArena-Eval, a meta-evaluation benchmark based on our collected preference data. The benchmark measures the accuracy of models in judging answer quality by comparing their pairwise assessments with human votes. Our experiments highlight the benchmark's challenges and emphasize the need for more reliable automated evaluation methods.
PDF352July 2, 2025