用于建模肽核酸相互作用的大规模多组学生物序列变换器
Large-Scale Multi-omic Biosequence Transformers for Modeling Peptide-Nucleotide Interactions
August 29, 2024
作者: Sully F. Chen, Robert J. Steele, Beakal Lemeneh, Shivanand P. Lad, Eric Oermann
cs.AI
摘要
Transformer架构已经彻底改变了生物信息学,并推动了对生物分子性质的理解和预测的进展。几乎所有关于大规模生物序列Transformer的研究都集中在一次处理一个领域(单组学),通常是核苷酸或肽段。这些模型在每个领域的下游任务中取得了令人难以置信的成功,并在肽段序列和结构建模方面取得了特别显著的突破。然而,这些单组学模型自然无法对多组学任务进行建模,其中最具生物学重要性的之一是核苷酸-肽段相互作用。
我们提出了训练第一个多组学核苷酸-肽段基础模型的工作。我们展示了这些多组学模型(MOMs)能够学习各种单组学分布之间的联合表示,这些表示与分子生物学中的中心法则紧密一致,尽管只是在未标记的生物序列上进行训练。我们进一步证明,MOMs可以进行微调以在肽段-核苷酸相互作用任务上取得最先进的结果,即预测给定寡核苷酸和肽段之间结合相互作用的吉布斯自由能变化(ΔG),以及由于寡核苷酸序列突变而导致的这种结合相互作用的影响(ΔΔG)。
值得注意的是,我们展示了多组学生物序列Transformer在没有任何先前结构训练的情况下紧急学习到有用的结构信息,从而使我们能够预测哪些肽段残基在肽段-核苷酸结合相互作用中起着最重要的作用。最后,我们提供证据表明,多组学生物序列模型不逊于在单组学分布上训练的基础模型,这表明了构建这些模型的更广义或基础性方法。
English
The transformer architecture has revolutionized bioinformatics and driven
progress in the understanding and prediction of the properties of biomolecules.
Almost all research on large-scale biosequence transformers has focused on one
domain at a time (single-omic), usually nucleotides or peptides. These models
have seen incredible success in downstream tasks in each domain and have
achieved particularly noteworthy breakthroughs in sequences of peptides and
structural modeling. However, these single-omic models are naturally incapable
of modeling multi-omic tasks, one of the most biologically critical being
nucleotide-peptide interactions.
We present our work training the first multi-omic nucleotide-peptide
foundation models. We show that these multi-omic models (MOMs) can learn joint
representations between various single-omic distributions that are emergently
consistent with the Central Dogma of molecular biology, despite only being
trained on unlabeled biosequences. We further demonstrate that MOMs can be
fine-tuned to achieve state-of-the-art results on peptide-nucleotide
interaction tasks, namely predicting the change in Gibbs free energy
({\Delta}G) of the binding interaction between a given oligonucleotide and
peptide, as well as the effect on this binding interaction due to mutations in
the oligonucleotide sequence ({\Delta}{\Delta}G).
Remarkably, we show that multi-omic biosequence transformers emergently learn
useful structural information without any prior structural training, allowing
us to predict which peptide residues are most involved in the
peptide-nucleotide binding interaction. Lastly, we provide evidence that
multi-omic biosequence models are non-inferior to foundation models trained on
single-omics distributions, suggesting a more generalized or foundational
approach to building these models.Summary
AI-Generated Summary