ChatPaper.aiChatPaper

NatureLM:解讀自然語言以促進科學發現

NatureLM: Deciphering the Language of Nature for Scientific Discovery

February 11, 2025
作者: Yingce Xia, Peiran Jin, Shufang Xie, Liang He, Chuan Cao, Renqian Luo, Guoqing Liu, Yue Wang, Zequn Liu, Yuan-Jyue Chen, Zekun Guo, Yeqi Bai, Pan Deng, Yaosen Min, Ziheng Lu, Hongxia Hao, Han Yang, Jielan Li, Chang Liu, Jia Zhang, Jianwei Zhu, Kehan Wu, Wei Zhang, Kaiyuan Gao, Qizhi Pei, Qian Wang, Xixian Liu, Yanting Li, Houtian Zhu, Yeqing Lu, Mingqian Ma, Zun Wang, Tian Xie, Krzysztof Maziarz, Marwin Segler, Zhao Yang, Zilong Chen, Yu Shi, Shuxin Zheng, Lijun Wu, Chen Hu, Peggy Dai, Tie-Yan Liu, Haiguang Liu, Tao Qin
cs.AI

摘要

基礎模型已經徹底改變了自然語言處理和人工智能,顯著提升了機器理解和生成人類語言的能力。受到這些基礎模型成功的啟發,研究人員已經為個別科學領域開發了基礎模型,包括小分子、材料、蛋白質、DNA 和 RNA。然而,這些模型通常是孤立訓練的,缺乏跨不同科學領域整合的能力。我們認識到這些領域內的實體都可以被表示為序列,這些序列共同構成了“自然語言”,因此我們引入了自然語言模型(簡稱 NatureLM),這是一個基於序列的科學基礎模型,旨在用於科學發現。NatureLM 預先使用來自多個科學領域的數據進行了訓練,提供了一個統一、多功能的模型,可以實現各種應用,包括:(i)使用文本指令生成和優化小分子、蛋白質、RNA 和材料;(ii)跨領域生成/設計,例如蛋白質到分子和蛋白質到 RNA 的生成;以及(iii)在 SMILES 到 IUPAC 翻譯和 USPTO-50k 上的逆合成等任務中實現最先進的性能。NatureLM 提供了一種有前景的通用方法,適用於各種科學任務,包括藥物發現(命中生成/優化、ADMET 優化、合成)、新材料設計,以及治療性蛋白質或核苷酸的開發。我們開發了不同規模的 NatureLM 模型(10億、80億和467億參數),並觀察到隨著模型大小增加,性能明顯提高。
English
Foundation models have revolutionized natural language processing and artificial intelligence, significantly enhancing how machines comprehend and generate human languages. Inspired by the success of these foundation models, researchers have developed foundation models for individual scientific domains, including small molecules, materials, proteins, DNA, and RNA. However, these models are typically trained in isolation, lacking the ability to integrate across different scientific domains. Recognizing that entities within these domains can all be represented as sequences, which together form the "language of nature", we introduce Nature Language Model (briefly, NatureLM), a sequence-based science foundation model designed for scientific discovery. Pre-trained with data from multiple scientific domains, NatureLM offers a unified, versatile model that enables various applications including: (i) generating and optimizing small molecules, proteins, RNA, and materials using text instructions; (ii) cross-domain generation/design, such as protein-to-molecule and protein-to-RNA generation; and (iii) achieving state-of-the-art performance in tasks like SMILES-to-IUPAC translation and retrosynthesis on USPTO-50k. NatureLM offers a promising generalist approach for various scientific tasks, including drug discovery (hit generation/optimization, ADMET optimization, synthesis), novel material design, and the development of therapeutic proteins or nucleotides. We have developed NatureLM models in different sizes (1 billion, 8 billion, and 46.7 billion parameters) and observed a clear improvement in performance as the model size increases.

Summary

AI-Generated Summary

PDF202February 12, 2025