SciEvalKit:科學通用智能開源評估工具包
SciEvalKit: An Open-source Evaluation Toolkit for Scientific General Intelligence
December 26, 2025
作者: Yiheng Wang, Yixin Chen, Shuo Li, Yifan Zhou, Bo Liu, Hengjian Gao, Jiakang Yuan, Jia Bu, Wanghan Xu, Yuhao Zhou, Xiangyu Zhao, Zhiwang Zhou, Fengxiang Wang, Haodong Duan, Songyang Zhang, Jun Yao, Han Deng, Yizhou Wang, Jiabei Xiao, Jiaqi Liu, Encheng Su, Yujie Liu, Weida Wang, Junchi Yao, Shenghe Zheng, Haoran Sun, Runmin Ma, Xiangchao Yan, Bo Zhang, Dongzhan Zhou, Shufei Zhang, Peng Ye, Xiaosong Wang, Shixiang Tang, Wenlong Zhang, Lei Bai
cs.AI
摘要
我們推出SciEvalKit——一個統一的基準測試工具包,專為評估跨科學領域及任務能力的AI模型而設計。有別於通用評估平台,SciEvalKit聚焦於科學智能的核心能力,包括科學多模態感知、科學多模態推理、科學多模態理解、科學符號推理、科學程式碼生成、科學假說生成與科學知識理解。該工具支援六大科學領域,涵蓋物理、化學乃至天文學與材料科學。SciEvalKit以專家級科學基準為基礎,所有任務均精選自真實領域專屬數據集,確保反映實際科學挑戰。工具包採用靈活可擴展的評估管線,支援跨模型與數據集的批量評估、自訂模型與數據集整合,並提供透明、可重現且可比較的結果。透過銜接能力導向評估與學科多樣性,SciEvalKit為新一代科學基礎模型與智能代理提供了標準化且可客製化的評估基礎架構。本工具包已開源並持續維護,以促進AI4Science領域的社群驅動發展與進步。
English
We introduce SciEvalKit, a unified benchmarking toolkit designed to evaluate AI models for science across a broad range of scientific disciplines and task capabilities. Unlike general-purpose evaluation platforms, SciEvalKit focuses on the core competencies of scientific intelligence, including Scientific Multimodal Perception, Scientific Multimodal Reasoning, Scientific Multimodal Understanding, Scientific Symbolic Reasoning, Scientific Code Generation, Science Hypothesis Generation and Scientific Knowledge Understanding. It supports six major scientific domains, spanning from physics and chemistry to astronomy and materials science. SciEvalKit builds a foundation of expert-grade scientific benchmarks, curated from real-world, domain-specific datasets, ensuring that tasks reflect authentic scientific challenges. The toolkit features a flexible, extensible evaluation pipeline that enables batch evaluation across models and datasets, supports custom model and dataset integration, and provides transparent, reproducible, and comparable results. By bridging capability-based evaluation and disciplinary diversity, SciEvalKit offers a standardized yet customizable infrastructure to benchmark the next generation of scientific foundation models and intelligent agents. The toolkit is open-sourced and actively maintained to foster community-driven development and progress in AI4Science.