神聖權杖:雙向等變長程DNA序列建模
Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling
March 5, 2024
作者: Yair Schiff, Chia-Hsiang Kao, Aaron Gokaslan, Tri Dao, Albert Gu, Volodymyr Kuleshov
cs.AI
摘要
大規模序列建模已引發快速進展,現在已延伸至生物學和基因組學領域。然而,建模基因組序列帶來挑戰,例如需要建模長程令牌交互作用、基因組上游和下游區域的影響,以及DNA的反向互補性(RC)。在這裡,我們提出了一種受到這些挑戰激勵的架構,該架構基於長程Mamba塊構建,並將其擴展為支持雙向性的BiMamba組件,以及支持RC等變換的MambaDNA塊。我們將MambaDNA作為Caduceus的基礎,這是第一個具有RC等變換性和雙向性的長程DNA語言模型系列,並且我們介紹了預訓練和微調策略,這些策略產生了Caduceus DNA基礎模型。Caduceus在下游基準測試中優於先前的長程模型;在一個具有挑戰性的長程變體效應預測任務中,Caduceus的表現超過了不利用雙向性或等變換性的規模大10倍的模型。
English
Large-scale sequence modeling has sparked rapid advances that now extend into
biology and genomics. However, modeling genomic sequences introduces challenges
such as the need to model long-range token interactions, the effects of
upstream and downstream regions of the genome, and the reverse complementarity
(RC) of DNA. Here, we propose an architecture motivated by these challenges
that builds off the long-range Mamba block, and extends it to a BiMamba
component that supports bi-directionality, and to a MambaDNA block that
additionally supports RC equivariance. We use MambaDNA as the basis of
Caduceus, the first family of RC equivariant bi-directional long-range DNA
language models, and we introduce pre-training and fine-tuning strategies that
yield Caduceus DNA foundation models. Caduceus outperforms previous long-range
models on downstream benchmarks; on a challenging long-range variant effect
prediction task, Caduceus exceeds the performance of 10x larger models that do
not leverage bi-directionality or equivariance.