ChemLLM：一种化学大型语言模型

摘要

大型语言模型（LLMs）在化学应用方面取得了令人瞩目的进展，包括分子性质预测、分子生成、实验方案设计等。然而，社区缺乏专门针对化学设计的基于对话的模型。挑战在于大多数化学数据和科学知识主要存储在结构化数据库中，直接使用这些结构化数据会影响模型保持连贯对话的能力。为了解决这个问题，我们开发了一种新颖的基于模板的指导构建方法，将结构化知识转化为简洁对话，使其适用于语言模型训练。通过利用这种方法，我们开发了ChemLLM，这是第一个专门用于化学的大型语言模型，能够在化学领域执行各种任务，并实现流畅的对话交互。ChemLLM在化学的三项主要任务，即名称转换、分子说明和反应预测方面击败了GPT-3.5，并在其中两项任务上超越了GPT-4。值得注意的是，尽管主要在以化学为中心的语料库上进行训练，ChemLLM还展现出对相关数学和物理任务的出色适应能力。此外，ChemLLM在化学领域的专业NLP任务中表现出熟练，如文献翻译和化学信息编程。ChemLLM为化学研究开辟了新的探索途径，而我们将结构化化学知识整合到对话系统中的方法为在各种科学领域开发LLMs设定了新的前沿。代码、数据集和模型权重可在hf.co/AI4Chem/ChemLLM-7B-Chat上公开获取。

English

Large language models (LLMs) have made impressive progress in chemistry applications, including molecular property prediction, molecular generation, experimental protocol design, etc. However, the community lacks a dialogue-based model specifically designed for chemistry. The challenge arises from the fact that most chemical data and scientific knowledge are primarily stored in structured databases, and the direct use of these structured data compromises the model's ability to maintain coherent dialogue. To tackle this issue, we develop a novel template-based instruction construction method that transforms structured knowledge into plain dialogue, making it suitable for language model training. By leveraging this approach, we develop ChemLLM, the first large language model dedicated to chemistry, capable of performing various tasks across chemical disciplines with smooth dialogue interaction. ChemLLM beats GPT-3.5 on all three principal tasks in chemistry, i.e., name conversion, molecular caption, and reaction prediction, and surpasses GPT-4 on two of them. Remarkably, ChemLLM also shows exceptional adaptability to related mathematical and physical tasks despite being trained mainly on chemical-centric corpora. Furthermore, ChemLLM demonstrates proficiency in specialized NLP tasks within chemistry, such as literature translation and cheminformatic programming. ChemLLM opens up a new avenue for exploration within chemical studies, while our method of integrating structured chemical knowledge into dialogue systems sets a new frontier for developing LLMs across various scientific fields. Codes, Datasets, and Model weights are publicly accessible at hf.co/AI4Chem/ChemLLM-7B-Chat.

ChemLLM：一种化学大型语言模型

ChemLLM: A Chemical Large Language Model

摘要

Support