MolXPT: 생성적 사전 학습을 위한 텍스트 기반 분자 래핑

초록

생성적 사전 학습 트랜스포머(Generative Pre-trained Transformer, GPT)는 자연어 처리 분야에서 큰 성공을 거두었으며, 관련 기술들은 분자 모델링에도 적용되고 있습니다. 텍스트가 과학적 발견을 기록하는 가장 중요한 매체임을 고려하여, 본 논문에서는 텍스트와 분자를 통합한 언어 모델인 MolXPT를 제안합니다. 이 모델은 텍스트로 감싸진 SMILES(분자의 시퀀스 표현)를 기반으로 사전 학습되었습니다. 간단히 설명하면, 각 시퀀스에서 분자 이름을 감지하고 이를 해당 SMILES로 대체합니다. 이를 통해 SMILES는 주변 텍스트의 정보를 활용할 수 있고, 반대로 텍스트도 SMILES의 정보를 활용할 수 있습니다. 위와 같이 감싸진 시퀀스와 PubMed의 텍스트 시퀀스, PubChem의 SMILES 시퀀스 모두 언어 모델에 입력되어 사전 학습됩니다. 실험 결과, MolXPT는 MoleculeNet에서 분자 특성 예측 강력한 베이스라인을 능가하며, 텍스트-분자 번역에서 최고의 모델과 비슷한 성능을 보이면서도 매개변수 수를 절반 이하로 사용합니다. 또한, 미세 조정 없이도 제로샷 분자 생성을 가능하게 합니다.

English

Generative pre-trained Transformer (GPT) has demonstrates its great success in natural language processing and related techniques have been adapted into molecular modeling. Considering that text is the most important record for scientific discovery, in this paper, we propose MolXPT, a unified language model of text and molecules pre-trained on SMILES (a sequence representation of molecules) wrapped by text. Briefly, we detect the molecule names in each sequence and replace them to the corresponding SMILES. In this way, the SMILES could leverage the information from surrounding text, and vice versa. The above wrapped sequences, text sequences from PubMed and SMILES sequences from PubChem are all fed into a language model for pre-training. Experimental results demonstrate that MolXPT outperforms strong baselines of molecular property prediction on MoleculeNet, performs comparably to the best model in text-molecule translation while using less than half of its parameters, and enables zero-shot molecular generation without finetuning.

MolXPT: 생성적 사전 학습을 위한 텍스트 기반 분자 래핑

MolXPT: Wrapping Molecules with Text for Generative Pre-training

초록

Support