ChatPaper.aiChatPaper

神聖權杖:雙向等變長程DNA序列建模

Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling

March 5, 2024
作者: Yair Schiff, Chia-Hsiang Kao, Aaron Gokaslan, Tri Dao, Albert Gu, Volodymyr Kuleshov
cs.AI

摘要

大規模序列建模已引發快速進展,現在已延伸至生物學和基因組學領域。然而,建模基因組序列帶來挑戰,例如需要建模長程令牌交互作用、基因組上游和下游區域的影響,以及DNA的反向互補性(RC)。在這裡,我們提出了一種受到這些挑戰激勵的架構,該架構基於長程Mamba塊構建,並將其擴展為支持雙向性的BiMamba組件,以及支持RC等變換的MambaDNA塊。我們將MambaDNA作為Caduceus的基礎,這是第一個具有RC等變換性和雙向性的長程DNA語言模型系列,並且我們介紹了預訓練和微調策略,這些策略產生了Caduceus DNA基礎模型。Caduceus在下游基準測試中優於先前的長程模型;在一個具有挑戰性的長程變體效應預測任務中,Caduceus的表現超過了不利用雙向性或等變換性的規模大10倍的模型。
English
Large-scale sequence modeling has sparked rapid advances that now extend into biology and genomics. However, modeling genomic sequences introduces challenges such as the need to model long-range token interactions, the effects of upstream and downstream regions of the genome, and the reverse complementarity (RC) of DNA. Here, we propose an architecture motivated by these challenges that builds off the long-range Mamba block, and extends it to a BiMamba component that supports bi-directionality, and to a MambaDNA block that additionally supports RC equivariance. We use MambaDNA as the basis of Caduceus, the first family of RC equivariant bi-directional long-range DNA language models, and we introduce pre-training and fine-tuning strategies that yield Caduceus DNA foundation models. Caduceus outperforms previous long-range models on downstream benchmarks; on a challenging long-range variant effect prediction task, Caduceus exceeds the performance of 10x larger models that do not leverage bi-directionality or equivariance.
PDF151December 15, 2024