用於建模肽-核苷酸相互作用的大規模多組學生物序列Transformer
Large-Scale Multi-omic Biosequence Transformers for Modeling Peptide-Nucleotide Interactions
August 29, 2024
作者: Sully F. Chen, Robert J. Steele, Beakal Lemeneh, Shivanand P. Lad, Eric Oermann
cs.AI
摘要
Transformer架構已經在生物信息學領域引起了革命,推動了對生物分子性質的理解和預測的進展。幾乎所有大規模生物序列Transformer的研究都集中在單一領域(單組學)上,通常是核苷酸或肽段。這些模型在每個領域的下游任務中取得了令人難以置信的成功,在肽段序列和結構建模方面尤其取得了顯著的突破。然而,這些單組學模型自然無法對多組學任務進行建模,其中最具生物學重要性的之一是核苷酸-肽段相互作用。
我們提出了我們的工作,訓練了第一批多組學核苷酸-肽段基礎模型。我們展示了這些多組學模型(MOMs)可以學習各種單組學分佈之間的聯合表示,這些表示與分子生物學的中心法則具有自發一致性,儘管只是在未標記的生物序列上進行了訓練。我們進一步證明,MOMs可以進行微調,以在肽段-核苷酸相互作用任務上取得最先進的結果,即預測給定寡核苷酸和肽段之間結合作用的吉布斯自由能變化(ΔG),以及由於寡核苷酸序列中的突變對這種結合作用的影響(ΔΔG)。
值得注意的是,我們展示了多組學生物序列Transformer在沒有任何先前結構訓練的情況下自發地學習到有用的結構信息,這使我們能夠預測哪些肽段殘基在肽段-核苷酸結合作用中起著最重要的作用。最後,我們提供證據表明,多組學生物序列模型與在單組學分佈上進行訓練的基礎模型並不相上下,這表明在構建這些模型時採用了更廣泛或基礎的方法。
English
The transformer architecture has revolutionized bioinformatics and driven
progress in the understanding and prediction of the properties of biomolecules.
Almost all research on large-scale biosequence transformers has focused on one
domain at a time (single-omic), usually nucleotides or peptides. These models
have seen incredible success in downstream tasks in each domain and have
achieved particularly noteworthy breakthroughs in sequences of peptides and
structural modeling. However, these single-omic models are naturally incapable
of modeling multi-omic tasks, one of the most biologically critical being
nucleotide-peptide interactions.
We present our work training the first multi-omic nucleotide-peptide
foundation models. We show that these multi-omic models (MOMs) can learn joint
representations between various single-omic distributions that are emergently
consistent with the Central Dogma of molecular biology, despite only being
trained on unlabeled biosequences. We further demonstrate that MOMs can be
fine-tuned to achieve state-of-the-art results on peptide-nucleotide
interaction tasks, namely predicting the change in Gibbs free energy
({\Delta}G) of the binding interaction between a given oligonucleotide and
peptide, as well as the effect on this binding interaction due to mutations in
the oligonucleotide sequence ({\Delta}{\Delta}G).
Remarkably, we show that multi-omic biosequence transformers emergently learn
useful structural information without any prior structural training, allowing
us to predict which peptide residues are most involved in the
peptide-nucleotide binding interaction. Lastly, we provide evidence that
multi-omic biosequence models are non-inferior to foundation models trained on
single-omics distributions, suggesting a more generalized or foundational
approach to building these models.Summary
AI-Generated Summary