ChatPaper.aiChatPaper

科学大语言模型综述:从数据基础到智能体前沿

A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers

August 28, 2025
作者: Ming Hu, Chenglong Ma, Wei Li, Wanghan Xu, Jiamin Wu, Jucheng Hu, Tianbin Li, Guohang Zhuang, Jiaqi Liu, Yingzhou Lu, Ying Chen, Chaoyang Zhang, Cheng Tan, Jie Ying, Guocheng Wu, Shujian Gao, Pengcheng Chen, Jiashi Lin, Haitao Wu, Lulu Chen, Fengxiang Wang, Yuanyuan Zhang, Xiangyu Zhao, Feilong Tang, Encheng Su, Junzhi Ning, Xinyao Liu, Ye Du, Changkai Ji, Cheng Tang, Huihui Xu, Ziyang Chen, Ziyan Huang, Jiyao Liu, Pengfei Jiang, Yizhou Wang, Chen Tang, Jianyu Wu, Yuchen Ren, Siyuan Yan, Zhonghua Wang, Zhongxing Xu, Shiyan Su, Shangquan Sun, Runkai Zhao, Zhisheng Zhang, Yu Liu, Fudi Wang, Yuanfeng Ji, Yanzhou Su, Hongming Shan, Chunmei Feng, Jiahao Xu, Jiangtao Yan, Wenhao Tang, Diping Song, Lihao Liu, Yanyan Huang, Lequan Yu, Bin Fu, Shujun Wang, Xiaomeng Li, Xiaowei Hu, Yun Gu, Ben Fei, Zhongying Deng, Benyou Wang, Yuewen Cao, Minjie Shen, Haodong Duan, Jie Xu, Yirong Chen, Fang Yan, Hongxia Hao, Jielan Li, Jiajun Du, Yanbo Wang, Imran Razzak, Chi Zhang, Lijun Wu, Conghui He, Zhaohui Lu, Jinhai Huang, Yihao Liu, Fenghua Ling, Yuqiang Li, Aoran Wang, Qihao Zheng, Nanqing Dong, Tianfan Fu, Dongzhan Zhou, Yan Lu, Wenlong Zhang, Jin Ye, Jianfei Cai, Wanli Ouyang, Yu Qiao, Zongyuan Ge, Shixiang Tang, Junjun He, Chunfeng Song, Lei Bai, Bowen Zhou
cs.AI

摘要

科学大语言模型(Sci-LLMs)正在革新科学研究中知识的表示、整合与应用方式,然而其发展进程深受科学数据复杂性的影响。本综述提出了一种以数据为核心的综合框架,将Sci-LLMs的发展重新定义为模型与其底层数据基质之间的协同进化。我们构建了科学数据的统一分类体系与科学知识的分层模型,着重强调了科学语料库相较于通用自然语言处理数据集所特有的多模态、跨尺度及领域特异性挑战。系统回顾了从通用基础模型到跨学科专用模型的最新Sci-LLMs进展,并深入分析了超过270个预训练/后训练数据集,揭示了为何Sci-LLMs对数据提出独特要求——即需要处理异质、多尺度、充满不确定性的语料,同时要求表征能保持领域不变性并支持跨模态推理。在评估方面,我们考察了逾190个基准数据集,追踪了从静态测试向过程导向与发现导向评估的转变,这些评估采用了先进的评测协议。这些以数据为中心的分析凸显了科学数据开发中的持续问题,并探讨了涉及半自动化标注流程与专家验证的新兴解决方案。最后,我们勾勒出一个向闭环系统转变的范式,其中基于Sci-LLMs的自主代理能够主动实验、验证,并贡献于一个活态、进化的知识库。总体而言,本研究为构建可信赖、持续进化的人工智能(AI)系统提供了路线图,这些系统将成为加速科学发现的真正合作伙伴。
English
Scientific Large Language Models (Sci-LLMs) are transforming how knowledge is represented, integrated, and applied in scientific research, yet their progress is shaped by the complex nature of scientific data. This survey presents a comprehensive, data-centric synthesis that reframes the development of Sci-LLMs as a co-evolution between models and their underlying data substrate. We formulate a unified taxonomy of scientific data and a hierarchical model of scientific knowledge, emphasizing the multimodal, cross-scale, and domain-specific challenges that differentiate scientific corpora from general natural language processing datasets. We systematically review recent Sci-LLMs, from general-purpose foundations to specialized models across diverse scientific disciplines, alongside an extensive analysis of over 270 pre-/post-training datasets, showing why Sci-LLMs pose distinct demands -- heterogeneous, multi-scale, uncertainty-laden corpora that require representations preserving domain invariance and enabling cross-modal reasoning. On evaluation, we examine over 190 benchmark datasets and trace a shift from static exams toward process- and discovery-oriented assessments with advanced evaluation protocols. These data-centric analyses highlight persistent issues in scientific data development and discuss emerging solutions involving semi-automated annotation pipelines and expert validation. Finally, we outline a paradigm shift toward closed-loop systems where autonomous agents based on Sci-LLMs actively experiment, validate, and contribute to a living, evolving knowledge base. Collectively, this work provides a roadmap for building trustworthy, continually evolving artificial intelligence (AI) systems that function as a true partner in accelerating scientific discovery.
PDF1203September 1, 2025