ChatPaper.aiChatPaper

用于化学语言的大型编码器-解码器基础模型系列

A Large Encoder-Decoder Family of Foundation Models For Chemical Language

July 24, 2024
作者: Eduardo Soares, Victor Shirasuna, Emilio Vital Brazil, Renato Cerqueira, Dmitry Zubarev, Kristin Schmidt
cs.AI

摘要

化学语言模型的大规模预训练方法代表了化学信息学的突破。这些方法通过在大型未标记语料库上进行自监督学习,学习输入标记的上下文化表示,在属性预测和分子生成等任务中表现出色。通常,这涉及在未标记数据上进行预训练,然后在特定任务上进行微调,减少对带标注数据的依赖,拓展化学语言表示的理解。本文介绍了一个基于大型编码器-解码器的化学基础模型,该模型在来自PubChem的经过筛选的9100万个SMILES样本数据集上进行了预训练,相当于40亿个分子标记。所提出的基础模型支持不同的复杂任务,包括量子属性预测,并提供了两个主要变体(2.89亿和8倍2.89亿)。我们在多个基准数据集上的实验验证了所提出模型在不同任务中提供最先进结果的能力。我们还对嵌入空间的组成进行了初步评估,作为推理任务的先决条件。我们证明,与最先进技术相比,所产生的潜在空间具有可分离性,并具有少样本学习能力。
English
Large-scale pre-training methodologies for chemical language models represent a breakthrough in cheminformatics. These methods excel in tasks such as property prediction and molecule generation by learning contextualized representations of input tokens through self-supervised learning on large unlabeled corpora. Typically, this involves pre-training on unlabeled data followed by fine-tuning on specific tasks, reducing dependence on annotated datasets and broadening chemical language representation understanding. This paper introduces a large encoder-decoder chemical foundation models pre-trained on a curated dataset of 91 million SMILES samples sourced from PubChem, which is equivalent to 4 billion of molecular tokens. The proposed foundation model supports different complex tasks, including quantum property prediction, and offer flexibility with two main variants (289M and 8times289M). Our experiments across multiple benchmark datasets validate the capacity of the proposed model in providing state-of-the-art results for different tasks. We also provide a preliminary assessment of the compositionality of the embedding space as a prerequisite for the reasoning tasks. We demonstrate that the produced latent space is separable compared to the state-of-the-art with few-shot learning capabilities.

Summary

AI-Generated Summary

PDF322November 28, 2024