ペプチド-ヌクレオチド相互作用をモデリングするための大規模マルチオミックバイオシーケンストランスフォーマー

要旨

Transformerアーキテクチャはバイオインフォマティクスを革新し、生体分子の性質の理解と予測の進歩を推進してきました。大規模なバイオシーケンスのTransformerに関するほとんどの研究は、通常、ヌクレオチドやペプチドなどの1つのドメイン（シングルオミック）に焦点を当てています。これらのモデルは、それぞれのドメインでの下流タスクにおいて驚異的な成功を収め、特にペプチドの配列や構造モデリングにおいて顕著な突破を達成しています。しかし、これらのシングルオミックモデルは、生物学的に最も重要なヌクレオチド-ペプチド相互作用をモデル化する能力を持っていません。私たちは、初めてのマルチオミックヌクレオチド-ペプチド基礎モデルのトレーニングに取り組んでいます。これらのマルチオミックモデル（MOMs）は、未ラベルのバイオシーケンスで訓練されたにも関わらず、分子生物学の中心法則と一貫性のあるさまざまなシングルオミック分布間の共同表現を学習できることを示しています。さらに、MOMsを微調整して、ペプチド-ヌクレオチド相互作用タスクにおいて最先端の結果を達成できることを示しています。具体的には、与えられたオリゴヌクレオチドとペプチドの結合相互作用のギブス自由エネルギー変化（ΔG）を予測すること、およびオリゴヌクレオチド配列の変異によるこの結合相互作用への影響（ΔΔG）を予測することが含まれます。驚くべきことに、私たちは、マルチオミックバイオシーケンスTransformerが、事前の構造トレーニングなしで有用な構造情報を学習し、ペプチド-ヌクレオチド結合相互作用に最も関与するペプチド残基を予測できることを示しています。最後に、シングルオミック分布でトレーニングされた基礎モデルと同等以上であることを証明し、これらのモデルを構築するためのより一般的または基礎的なアプローチを示唆しています。

English

The transformer architecture has revolutionized bioinformatics and driven progress in the understanding and prediction of the properties of biomolecules. Almost all research on large-scale biosequence transformers has focused on one domain at a time (single-omic), usually nucleotides or peptides. These models have seen incredible success in downstream tasks in each domain and have achieved particularly noteworthy breakthroughs in sequences of peptides and structural modeling. However, these single-omic models are naturally incapable of modeling multi-omic tasks, one of the most biologically critical being nucleotide-peptide interactions. We present our work training the first multi-omic nucleotide-peptide foundation models. We show that these multi-omic models (MOMs) can learn joint representations between various single-omic distributions that are emergently consistent with the Central Dogma of molecular biology, despite only being trained on unlabeled biosequences. We further demonstrate that MOMs can be fine-tuned to achieve state-of-the-art results on peptide-nucleotide interaction tasks, namely predicting the change in Gibbs free energy ({\Delta}G) of the binding interaction between a given oligonucleotide and peptide, as well as the effect on this binding interaction due to mutations in the oligonucleotide sequence ({\Delta}{\Delta}G). Remarkably, we show that multi-omic biosequence transformers emergently learn useful structural information without any prior structural training, allowing us to predict which peptide residues are most involved in the peptide-nucleotide binding interaction. Lastly, we provide evidence that multi-omic biosequence models are non-inferior to foundation models trained on single-omics distributions, suggesting a more generalized or foundational approach to building these models.

ペプチド-ヌクレオチド相互作用をモデリングするための大規模マルチオミックバイオシーケンストランスフォーマー

Large-Scale Multi-omic Biosequence Transformers for Modeling Peptide-Nucleotide Interactions

要旨

Support