SciEvalKit:科学通用智能开源评估工具包
SciEvalKit: An Open-source Evaluation Toolkit for Scientific General Intelligence
December 26, 2025
作者: Yiheng Wang, Yixin Chen, Shuo Li, Yifan Zhou, Bo Liu, Hengjian Gao, Jiakang Yuan, Jia Bu, Wanghan Xu, Yuhao Zhou, Xiangyu Zhao, Zhiwang Zhou, Fengxiang Wang, Haodong Duan, Songyang Zhang, Jun Yao, Han Deng, Yizhou Wang, Jiabei Xiao, Jiaqi Liu, Encheng Su, Yujie Liu, Weida Wang, Junchi Yao, Shenghe Zheng, Haoran Sun, Runmin Ma, Xiangchao Yan, Bo Zhang, Dongzhan Zhou, Shufei Zhang, Peng Ye, Xiaosong Wang, Shixiang Tang, Wenlong Zhang, Lei Bai
cs.AI
摘要
我们推出SciEvalKit——一个统一的基准评测工具包,旨在跨多科学学科和任务能力评估人工智能模型。与通用评估平台不同,SciEvalKit聚焦科学智能的核心能力,包括科学多模态感知、科学多模态推理、科学多模态理解、科学符号推理、科学代码生成、科学假设生成与科学知识理解。该工具包支持从物理化学到天文学与材料科学的六大科学领域,通过从真实世界领域特定数据集中精选内容,构建起专家级科学基准的基础,确保任务反映真实的科学挑战。该工具包采用灵活可扩展的评估流水线设计,支持跨模型与数据集的批量评估,兼容自定义模型与数据集集成,并提供透明、可复现、可比较的评估结果。通过连接能力导向评估与学科多样性,SciEvalKit为新一代科学基础模型与智能体的基准测试提供了标准化且可定制的基础设施。本工具包已开源并持续维护,以促进AI4Science领域的社区驱动发展与进步。
English
We introduce SciEvalKit, a unified benchmarking toolkit designed to evaluate AI models for science across a broad range of scientific disciplines and task capabilities. Unlike general-purpose evaluation platforms, SciEvalKit focuses on the core competencies of scientific intelligence, including Scientific Multimodal Perception, Scientific Multimodal Reasoning, Scientific Multimodal Understanding, Scientific Symbolic Reasoning, Scientific Code Generation, Science Hypothesis Generation and Scientific Knowledge Understanding. It supports six major scientific domains, spanning from physics and chemistry to astronomy and materials science. SciEvalKit builds a foundation of expert-grade scientific benchmarks, curated from real-world, domain-specific datasets, ensuring that tasks reflect authentic scientific challenges. The toolkit features a flexible, extensible evaluation pipeline that enables batch evaluation across models and datasets, supports custom model and dataset integration, and provides transparent, reproducible, and comparable results. By bridging capability-based evaluation and disciplinary diversity, SciEvalKit offers a standardized yet customizable infrastructure to benchmark the next generation of scientific foundation models and intelligent agents. The toolkit is open-sourced and actively maintained to foster community-driven development and progress in AI4Science.