一個龐大的編碼器-解碼器家族基礎模型,用於化學語言。
A Large Encoder-Decoder Family of Foundation Models For Chemical Language
July 24, 2024
作者: Eduardo Soares, Victor Shirasuna, Emilio Vital Brazil, Renato Cerqueira, Dmitry Zubarev, Kristin Schmidt
cs.AI
摘要
化學語言模型的大規模預訓練方法代表了化學信息學的一項突破。這些方法在屬性預測和分子生成等任務中表現出色,通過在大型未標記語料庫上進行自監督學習,學習輸入標記的情境化表示。通常,這涉及在未標記數據上進行預訓練,然後在特定任務上進行微調,減少對標註數據的依賴,擴展化學語言表示的理解。本文介紹了一種大型編碼器-解碼器化學基礎模型,該模型在由 PubChem 提供的經過精心策劃的 9100 萬個 SMILES 樣本數據集上進行了預訓練,相當於 40 億個分子標記。所提出的基礎模型支持不同的複雜任務,包括量子屬性預測,並提供兩個主要變體(2.89 億和 8 倍 2.89 億)。我們在多個基準數據集上的實驗驗證了所提出模型在不同任務中提供最先進結果的能力。我們還對嵌入空間的組成性進行了初步評估,作為推理任務的先決條件。我們展示了所產生的潛在空間與最先進的具有少樣本學習能力相比是可分離的。
English
Large-scale pre-training methodologies for chemical language models represent
a breakthrough in cheminformatics. These methods excel in tasks such as
property prediction and molecule generation by learning contextualized
representations of input tokens through self-supervised learning on large
unlabeled corpora. Typically, this involves pre-training on unlabeled data
followed by fine-tuning on specific tasks, reducing dependence on annotated
datasets and broadening chemical language representation understanding. This
paper introduces a large encoder-decoder chemical foundation models pre-trained
on a curated dataset of 91 million SMILES samples sourced from PubChem, which
is equivalent to 4 billion of molecular tokens. The proposed foundation model
supports different complex tasks, including quantum property prediction, and
offer flexibility with two main variants (289M and 8times289M). Our
experiments across multiple benchmark datasets validate the capacity of the
proposed model in providing state-of-the-art results for different tasks. We
also provide a preliminary assessment of the compositionality of the embedding
space as a prerequisite for the reasoning tasks. We demonstrate that the
produced latent space is separable compared to the state-of-the-art with
few-shot learning capabilities.Summary
AI-Generated Summary