ChatPaper.aiChatPaper

ChemLLM:一個化學大型語言模型

ChemLLM: A Chemical Large Language Model

February 10, 2024
作者: Di Zhang, Wei Liu, Qian Tan, Jingdan Chen, Hang Yan, Yuliang Yan, Jiatong Li, Weiran Huang, Xiangyu Yue, Dongzhan Zhou, Shufei Zhang, Mao Su, Hansen Zhong, Yuqiang Li, Wanli Ouyang
cs.AI

摘要

大型語言模型(LLMs)在化學應用中取得了令人矚目的進展,包括分子性質預測、分子生成、實驗方案設計等。然而,社群缺乏專門為化學設計的基於對話的模型。挑戰在於大多數化學數據和科學知識主要存儲在結構化數據庫中,直接使用這些結構化數據會影響模型保持連貫對話的能力。為應對此問題,我們開發了一種新穎的基於模板的指令構建方法,將結構化知識轉換為純對話,使其適合語言模型訓練。通過利用這種方法,我們開發了ChemLLM,這是首個專為化學而設的大型語言模型,能夠在化學領域執行各種任務並實現流暢的對話互動。ChemLLM在化學的三個主要任務,即命名轉換、分子標題和反應預測上擊敗了GPT-3.5,並在其中兩個任務上超越了GPT-4。值得注意的是,儘管主要在以化學為中心的語料庫上進行訓練,ChemLLM還展現了對相關數學和物理任務的出色適應能力。此外,ChemLLM在化學領域的專業NLP任務中表現出色,如文獻翻譯和化學信息學編程。ChemLLM為化學研究開辟了新的探索途徑,而我們將結構化化學知識整合到對話系統的方法為在各種科學領域開發LLMs設定了新的前沿。代碼、數據集和模型權重可在hf.co/AI4Chem/ChemLLM-7B-Chat公開訪問。
English
Large language models (LLMs) have made impressive progress in chemistry applications, including molecular property prediction, molecular generation, experimental protocol design, etc. However, the community lacks a dialogue-based model specifically designed for chemistry. The challenge arises from the fact that most chemical data and scientific knowledge are primarily stored in structured databases, and the direct use of these structured data compromises the model's ability to maintain coherent dialogue. To tackle this issue, we develop a novel template-based instruction construction method that transforms structured knowledge into plain dialogue, making it suitable for language model training. By leveraging this approach, we develop ChemLLM, the first large language model dedicated to chemistry, capable of performing various tasks across chemical disciplines with smooth dialogue interaction. ChemLLM beats GPT-3.5 on all three principal tasks in chemistry, i.e., name conversion, molecular caption, and reaction prediction, and surpasses GPT-4 on two of them. Remarkably, ChemLLM also shows exceptional adaptability to related mathematical and physical tasks despite being trained mainly on chemical-centric corpora. Furthermore, ChemLLM demonstrates proficiency in specialized NLP tasks within chemistry, such as literature translation and cheminformatic programming. ChemLLM opens up a new avenue for exploration within chemical studies, while our method of integrating structured chemical knowledge into dialogue systems sets a new frontier for developing LLMs across various scientific fields. Codes, Datasets, and Model weights are publicly accessible at hf.co/AI4Chem/ChemLLM-7B-Chat.
PDF317December 15, 2024