MolXPT：用文本包装分子进行生成式预训练

摘要

生成式预训练变换器（GPT）已经展示了在自然语言处理方面的巨大成功，并且相关技术已经被应用于分子建模中。考虑到文本是科学发现中最重要的记录，本文提出了MolXPT，一个在SMILES（分子的序列表示）上预训练的文本和分子的统一语言模型。简而言之，我们检测每个序列中的分子名称，并将其替换为相应的SMILES。通过这种方式，SMILES可以利用周围文本的信息，反之亦然。上述包装的序列，来自PubMed的文本序列和来自PubChem的SMILES序列都被送入语言模型进行预训练。实验结果表明，MolXPT在MoleculeNet上的分子性质预测优于强基线模型，与最佳的文本-分子翻译模型相媲美，同时使用不到一半的参数，并且能够实现零调校的分子生成。

English

Generative pre-trained Transformer (GPT) has demonstrates its great success in natural language processing and related techniques have been adapted into molecular modeling. Considering that text is the most important record for scientific discovery, in this paper, we propose MolXPT, a unified language model of text and molecules pre-trained on SMILES (a sequence representation of molecules) wrapped by text. Briefly, we detect the molecule names in each sequence and replace them to the corresponding SMILES. In this way, the SMILES could leverage the information from surrounding text, and vice versa. The above wrapped sequences, text sequences from PubMed and SMILES sequences from PubChem are all fed into a language model for pre-training. Experimental results demonstrate that MolXPT outperforms strong baselines of molecular property prediction on MoleculeNet, performs comparably to the best model in text-molecule translation while using less than half of its parameters, and enables zero-shot molecular generation without finetuning.

MolXPT：用文本包装分子进行生成式预训练

MolXPT: Wrapping Molecules with Text for Generative Pre-training

摘要

Support