ChatPaper.aiChatPaper

科學大型語言模型綜述:從數據基礎到智能代理前沿

A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers

August 28, 2025
作者: Ming Hu, Chenglong Ma, Wei Li, Wanghan Xu, Jiamin Wu, Jucheng Hu, Tianbin Li, Guohang Zhuang, Jiaqi Liu, Yingzhou Lu, Ying Chen, Chaoyang Zhang, Cheng Tan, Jie Ying, Guocheng Wu, Shujian Gao, Pengcheng Chen, Jiashi Lin, Haitao Wu, Lulu Chen, Fengxiang Wang, Yuanyuan Zhang, Xiangyu Zhao, Feilong Tang, Encheng Su, Junzhi Ning, Xinyao Liu, Ye Du, Changkai Ji, Cheng Tang, Huihui Xu, Ziyang Chen, Ziyan Huang, Jiyao Liu, Pengfei Jiang, Yizhou Wang, Chen Tang, Jianyu Wu, Yuchen Ren, Siyuan Yan, Zhonghua Wang, Zhongxing Xu, Shiyan Su, Shangquan Sun, Runkai Zhao, Zhisheng Zhang, Yu Liu, Fudi Wang, Yuanfeng Ji, Yanzhou Su, Hongming Shan, Chunmei Feng, Jiahao Xu, Jiangtao Yan, Wenhao Tang, Diping Song, Lihao Liu, Yanyan Huang, Lequan Yu, Bin Fu, Shujun Wang, Xiaomeng Li, Xiaowei Hu, Yun Gu, Ben Fei, Zhongying Deng, Benyou Wang, Yuewen Cao, Minjie Shen, Haodong Duan, Jie Xu, Yirong Chen, Fang Yan, Hongxia Hao, Jielan Li, Jiajun Du, Yanbo Wang, Imran Razzak, Chi Zhang, Lijun Wu, Conghui He, Zhaohui Lu, Jinhai Huang, Yihao Liu, Fenghua Ling, Yuqiang Li, Aoran Wang, Qihao Zheng, Nanqing Dong, Tianfan Fu, Dongzhan Zhou, Yan Lu, Wenlong Zhang, Jin Ye, Jianfei Cai, Wanli Ouyang, Yu Qiao, Zongyuan Ge, Shixiang Tang, Junjun He, Chunfeng Song, Lei Bai, Bowen Zhou
cs.AI

摘要

科學大型語言模型(Sci-LLMs)正在改變知識在科學研究中的表示、整合與應用方式,然而其發展深受科學數據複雜性的影響。本綜述提出了一種以數據為中心的全面性綜合框架,將Sci-LLMs的發展重新定義為模型與其基礎數據基質之間的共同演化。我們構建了一個統一的科學數據分類體系和一個層次化的科學知識模型,強調了科學語料庫與通用自然語言處理數據集之間的多模態、跨尺度及領域特定性挑戰。我們系統性地回顧了近期Sci-LLMs的發展,從通用基礎模型到跨多種科學領域的專業模型,並對超過270個預訓練/後訓練數據集進行了深入分析,揭示了Sci-LLMs為何對數據提出獨特要求——即需要處理異質性、多尺度、充滿不確定性的語料庫,這些語料庫要求模型能夠保持領域不變性並支持跨模態推理。在評估方面,我們審視了超過190個基準數據集,並追蹤了從靜態考試向基於過程和發現導向的評估轉變,這些評估採用了先進的評估協議。這些以數據為中心的分析凸顯了科學數據開發中的持續問題,並討論了涉及半自動化註釋流程和專家驗證的新興解決方案。最後,我們勾勒出一種向閉環系統轉變的範式,其中基於Sci-LLMs的自動化代理能夠主動實驗、驗證並貢獻於一個活躍、不斷演化的知識庫。總體而言,這項工作為構建可信賴、持續演化的人工智能(AI)系統提供了路線圖,這些系統將作為加速科學發現的真正夥伴發揮作用。
English
Scientific Large Language Models (Sci-LLMs) are transforming how knowledge is represented, integrated, and applied in scientific research, yet their progress is shaped by the complex nature of scientific data. This survey presents a comprehensive, data-centric synthesis that reframes the development of Sci-LLMs as a co-evolution between models and their underlying data substrate. We formulate a unified taxonomy of scientific data and a hierarchical model of scientific knowledge, emphasizing the multimodal, cross-scale, and domain-specific challenges that differentiate scientific corpora from general natural language processing datasets. We systematically review recent Sci-LLMs, from general-purpose foundations to specialized models across diverse scientific disciplines, alongside an extensive analysis of over 270 pre-/post-training datasets, showing why Sci-LLMs pose distinct demands -- heterogeneous, multi-scale, uncertainty-laden corpora that require representations preserving domain invariance and enabling cross-modal reasoning. On evaluation, we examine over 190 benchmark datasets and trace a shift from static exams toward process- and discovery-oriented assessments with advanced evaluation protocols. These data-centric analyses highlight persistent issues in scientific data development and discuss emerging solutions involving semi-automated annotation pipelines and expert validation. Finally, we outline a paradigm shift toward closed-loop systems where autonomous agents based on Sci-LLMs actively experiment, validate, and contribute to a living, evolving knowledge base. Collectively, this work provides a roadmap for building trustworthy, continually evolving artificial intelligence (AI) systems that function as a true partner in accelerating scientific discovery.
PDF1253September 1, 2025