ChemLLM: 化学大規模言語モデル

要旨

大規模言語モデル（LLM）は、分子特性予測、分子生成、実験プロトコル設計など、化学分野での応用において目覚ましい進歩を遂げてきました。しかし、化学に特化した対話型モデルはまだ存在していません。この課題は、化学データや科学知識のほとんどが構造化データベースに保存されており、これらの構造化データを直接使用すると、モデルが一貫した対話を維持する能力が損なわれるという事実に起因しています。この問題を解決するため、我々は構造化された知識を平易な対話に変換する新しいテンプレートベースの指示構築法を開発し、言語モデルのトレーニングに適した形式にしました。このアプローチを活用して、化学に特化した初の大規模言語モデルであるChemLLMを開発し、化学分野のさまざまなタスクをスムーズな対話インタラクションで実行可能にしました。ChemLLMは、化学の主要な3つのタスク（名称変換、分子キャプション、反応予測）においてGPT-3.5を上回り、そのうち2つのタスクではGPT-4をも凌駕しました。特に注目すべきは、ChemLLMが主に化学中心のコーパスでトレーニングされているにもかかわらず、関連する数学や物理のタスクにも優れた適応性を示した点です。さらに、ChemLLMは、文献翻訳やケモインフォマティクスプログラミングなど、化学分野における専門的な自然言語処理タスクにも熟達しています。ChemLLMは化学研究における新たな探求の道を開き、構造化された化学知識を対話システムに統合する我々の手法は、さまざまな科学分野におけるLLM開発の新たなフロンティアを築きました。コード、データセット、およびモデルウェイトはhf.co/AI4Chem/ChemLLM-7B-Chatで公開されています。

English

Large language models (LLMs) have made impressive progress in chemistry applications, including molecular property prediction, molecular generation, experimental protocol design, etc. However, the community lacks a dialogue-based model specifically designed for chemistry. The challenge arises from the fact that most chemical data and scientific knowledge are primarily stored in structured databases, and the direct use of these structured data compromises the model's ability to maintain coherent dialogue. To tackle this issue, we develop a novel template-based instruction construction method that transforms structured knowledge into plain dialogue, making it suitable for language model training. By leveraging this approach, we develop ChemLLM, the first large language model dedicated to chemistry, capable of performing various tasks across chemical disciplines with smooth dialogue interaction. ChemLLM beats GPT-3.5 on all three principal tasks in chemistry, i.e., name conversion, molecular caption, and reaction prediction, and surpasses GPT-4 on two of them. Remarkably, ChemLLM also shows exceptional adaptability to related mathematical and physical tasks despite being trained mainly on chemical-centric corpora. Furthermore, ChemLLM demonstrates proficiency in specialized NLP tasks within chemistry, such as literature translation and cheminformatic programming. ChemLLM opens up a new avenue for exploration within chemical studies, while our method of integrating structured chemical knowledge into dialogue systems sets a new frontier for developing LLMs across various scientific fields. Codes, Datasets, and Model weights are publicly accessible at hf.co/AI4Chem/ChemLLM-7B-Chat.

ChemLLM: 化学大規模言語モデル

ChemLLM: A Chemical Large Language Model

要旨

Support