ChatPaper.aiChatPaper

权杖:双向等变长程DNA序列建模

Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling

March 5, 2024
作者: Yair Schiff, Chia-Hsiang Kao, Aaron Gokaslan, Tri Dao, Albert Gu, Volodymyr Kuleshov
cs.AI

摘要

大规模序列建模已经引发了快速进展,现在已延伸至生物学和基因组学。然而,建模基因组序列引入了挑战,如需要建模长程令牌相互作用、基因组上游和下游区域的影响,以及DNA的反向互补性(RC)。在这里,我们提出了一种受到这些挑战激励的架构,它基于长程Mamba块,并将其扩展为支持双向性的BiMamba组件,以及支持RC等变性的MambaDNA块。我们以MambaDNA作为Caduceus的基础,这是第一个RC等变双向长程DNA语言模型系列,我们引入了预训练和微调策略,产生了Caduceus DNA基础模型。Caduceus在下游基准测试中优于先前的长程模型;在具有挑战性的长程变异效应预测任务中,Caduceus的表现超过了不利用双向性或等变性的规模大10倍的模型。
English
Large-scale sequence modeling has sparked rapid advances that now extend into biology and genomics. However, modeling genomic sequences introduces challenges such as the need to model long-range token interactions, the effects of upstream and downstream regions of the genome, and the reverse complementarity (RC) of DNA. Here, we propose an architecture motivated by these challenges that builds off the long-range Mamba block, and extends it to a BiMamba component that supports bi-directionality, and to a MambaDNA block that additionally supports RC equivariance. We use MambaDNA as the basis of Caduceus, the first family of RC equivariant bi-directional long-range DNA language models, and we introduce pre-training and fine-tuning strategies that yield Caduceus DNA foundation models. Caduceus outperforms previous long-range models on downstream benchmarks; on a challenging long-range variant effect prediction task, Caduceus exceeds the performance of 10x larger models that do not leverage bi-directionality or equivariance.
PDF151December 15, 2024