SciArena: 科学文献タスクにおける基盤モデルのためのオープン評価プラットフォーム

要旨

SciArenaを紹介します。これは、科学文献タスクにおける基盤モデルの評価のためのオープンで協力的なプラットフォームです。従来の科学文献理解と統合のためのベンチマークとは異なり、SciArenaは研究コミュニティを直接巻き込み、Chatbot Arenaの評価アプローチに従って、モデル比較に対するコミュニティ投票を行います。集団知を活用することで、SciArenaは、文献に基づいた長文の回答を要求するオープンエンドの科学タスクにおけるモデル性能のコミュニティ主導の評価を提供します。このプラットフォームは現在、23のオープンソースおよびプロプライエタリな基盤モデルをサポートしており、多様な科学分野の信頼できる研究者から13,000以上の投票を収集しています。これまでに収集されたデータを分析し、提出された質問が多様であり、現実世界の文献ニーズに沿っていること、また、参加研究者が評価において強い自己一貫性と相互注釈者一致を示していることを確認します。モデルランキングリーダーボードに基づいて、結果と洞察について議論します。文献タスクのためのモデルベースの自動評価システムの構築に関する研究をさらに促進するために、収集した選好データに基づくメタ評価ベンチマークであるSciArena-Evalをリリースします。このベンチマークは、モデルのペアワイズ評価と人間の投票を比較することで、回答品質を判断するモデルの精度を測定します。私たちの実験は、ベンチマークの課題を浮き彫りにし、より信頼性の高い自動評価方法の必要性を強調しています。

English

We present SciArena, an open and collaborative platform for evaluating foundation models on scientific literature tasks. Unlike traditional benchmarks for scientific literature understanding and synthesis, SciArena engages the research community directly, following the Chatbot Arena evaluation approach of community voting on model comparisons. By leveraging collective intelligence, SciArena offers a community-driven evaluation of model performance on open-ended scientific tasks that demand literature-grounded, long-form responses. The platform currently supports 23 open-source and proprietary foundation models and has collected over 13,000 votes from trusted researchers across diverse scientific domains. We analyze the data collected so far and confirm that the submitted questions are diverse, aligned with real-world literature needs, and that participating researchers demonstrate strong self-consistency and inter-annotator agreement in their evaluations. We discuss the results and insights based on the model ranking leaderboard. To further promote research in building model-based automated evaluation systems for literature tasks, we release SciArena-Eval, a meta-evaluation benchmark based on our collected preference data. The benchmark measures the accuracy of models in judging answer quality by comparing their pairwise assessments with human votes. Our experiments highlight the benchmark's challenges and emphasize the need for more reliable automated evaluation methods.

SciArena: 科学文献タスクにおける基盤モデルのためのオープン評価プラットフォーム

SciArena: An Open Evaluation Platform for Foundation Models in Scientific Literature Tasks

要旨

Support