MolXPT：將分子用文字包裹以進行生成預訓練

摘要

生成式預訓練Transformer（GPT）已展現出在自然語言處理方面的巨大成功，並且相關技術已被應用於分子建模。考慮到文本是科學發現中最重要的記錄，本文提出MolXPT，一個統一的語言模型，預先在SMILES（分子的序列表示）中包裹了文本。簡而言之，我們在每個序列中檢測分子名稱並將其替換為相應的SMILES。通過這種方式，SMILES可以利用周圍文本的信息，反之亦然。上述包裹的序列，來自PubMed的文本序列和來自PubChem的SMILES序列都被餵入語言模型進行預訓練。實驗結果表明，MolXPT在MoleculeNet的分子性質預測方面優於強基線，與最佳的文本-分子翻譯模型相當，同時使用不到一半的參數，並實現了零擬合的分子生成。

English

Generative pre-trained Transformer (GPT) has demonstrates its great success in natural language processing and related techniques have been adapted into molecular modeling. Considering that text is the most important record for scientific discovery, in this paper, we propose MolXPT, a unified language model of text and molecules pre-trained on SMILES (a sequence representation of molecules) wrapped by text. Briefly, we detect the molecule names in each sequence and replace them to the corresponding SMILES. In this way, the SMILES could leverage the information from surrounding text, and vice versa. The above wrapped sequences, text sequences from PubMed and SMILES sequences from PubChem are all fed into a language model for pre-training. Experimental results demonstrate that MolXPT outperforms strong baselines of molecular property prediction on MoleculeNet, performs comparably to the best model in text-molecule translation while using less than half of its parameters, and enables zero-shot molecular generation without finetuning.

MolXPT：將分子用文字包裹以進行生成預訓練

MolXPT: Wrapping Molecules with Text for Generative Pre-training

摘要

Support