ChemLLM: 화학 분야 대규모 언어 모델

초록

대형 언어 모델(LLMs)은 분자 특성 예측, 분자 생성, 실험 프로토콜 설계 등 화학 분야 응용에서 인상적인 진전을 이루어 왔습니다. 그러나 화학에 특화된 대화 기반 모델은 아직 부족한 상황입니다. 이 문제는 대부분의 화학 데이터와 과학적 지식이 주로 구조화된 데이터베이스에 저장되어 있으며, 이러한 구조화된 데이터를 직접 사용할 경우 모델의 일관된 대화 능력이 저하되기 때문에 발생합니다. 이 문제를 해결하기 위해, 우리는 구조화된 지식을 평문 대화로 변환하는 새로운 템플릿 기반 명령어 구성 방법을 개발하여 언어 모델 학습에 적합하도록 만들었습니다. 이 접근법을 활용하여, 우리는 화학 분야 전반의 다양한 작업을 원활한 대화 상호작용으로 수행할 수 있는 최초의 대형 언어 모델인 ChemLLM을 개발했습니다. ChemLLM은 화학의 세 가지 주요 작업, 즉 명칭 변환, 분자 설명, 반응 예측에서 GPT-3.5를 모두 능가하며, 이 중 두 작업에서는 GPT-4도 뛰어넘었습니다. 특히, ChemLLM은 주로 화학 중심 코퍼스로 훈련되었음에도 관련 수학 및 물리학 작업에서도 탁월한 적응력을 보였습니다. 더 나아가, ChemLLM은 문헌 번역 및 화학정보학 프로그래밍과 같은 화학 내 특화된 NLP 작업에서도 뛰어난 능력을 입증했습니다. ChemLLM은 화학 연구 내 새로운 탐구의 길을 열었으며, 구조화된 화학 지식을 대화 시스템에 통합하는 우리의 방법은 다양한 과학 분야에서 LLM 개발을 위한 새로운 지평을 제시합니다. 코드, 데이터셋, 모델 가중치는 hf.co/AI4Chem/ChemLLM-7B-Chat에서 공개적으로 접근 가능합니다.

English

Large language models (LLMs) have made impressive progress in chemistry applications, including molecular property prediction, molecular generation, experimental protocol design, etc. However, the community lacks a dialogue-based model specifically designed for chemistry. The challenge arises from the fact that most chemical data and scientific knowledge are primarily stored in structured databases, and the direct use of these structured data compromises the model's ability to maintain coherent dialogue. To tackle this issue, we develop a novel template-based instruction construction method that transforms structured knowledge into plain dialogue, making it suitable for language model training. By leveraging this approach, we develop ChemLLM, the first large language model dedicated to chemistry, capable of performing various tasks across chemical disciplines with smooth dialogue interaction. ChemLLM beats GPT-3.5 on all three principal tasks in chemistry, i.e., name conversion, molecular caption, and reaction prediction, and surpasses GPT-4 on two of them. Remarkably, ChemLLM also shows exceptional adaptability to related mathematical and physical tasks despite being trained mainly on chemical-centric corpora. Furthermore, ChemLLM demonstrates proficiency in specialized NLP tasks within chemistry, such as literature translation and cheminformatic programming. ChemLLM opens up a new avenue for exploration within chemical studies, while our method of integrating structured chemical knowledge into dialogue systems sets a new frontier for developing LLMs across various scientific fields. Codes, Datasets, and Model weights are publicly accessible at hf.co/AI4Chem/ChemLLM-7B-Chat.

ChemLLM: 화학 분야 대규모 언어 모델

ChemLLM: A Chemical Large Language Model

초록

Support