HyenaDNA:單核苷酸分辨率下的長距離基因組序列建模
HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution
June 27, 2023
作者: Eric Nguyen, Michael Poli, Marjan Faizi, Armin Thomas, Callum Birch-Sykes, Michael Wornow, Aman Patel, Clayton Rabideau, Stefano Massaroli, Yoshua Bengio, Stefano Ermon, Stephen A. Baccus, Chris Ré
cs.AI
摘要
基因組(DNA)序列編碼了大量有關基因調控和蛋白質合成的信息。類似於自然語言模型,研究人員提出了基因組學中的基礎模型,以從未標記的基因組數據中學習可泛化的特徵,然後對下游任務進行微調,例如識別調節元素。由於注意力的二次擴展,先前基於Transformer的基因組模型使用512至4k個標記作為上下文(<0.001%的人類基因組),顯著限制了對DNA中長距離相互作用的建模。此外,這些方法依賴於分詞器來聚合有意義的DNA單元,失去了單核苷酸分辨率,細微的基因變異可能通過單核苷酸多態性(SNPs)完全改變蛋白質功能。最近,基於隱式卷積的大型語言模型Hyena顯示出與注意力相匹配的質量,同時允許更長的上下文長度和更低的時間複雜度。利用Hyena的新的長距離能力,我們提出了HyenaDNA,一個在人類參考基因組上預訓練的基因組基礎模型,單核苷酸級別的上下文長度可達100萬個標記,比先前基於密集注意力的模型增加了多達500倍。HyenaDNA在序列長度上呈次二次擴展(訓練速度比Transformer快160倍),使用單核苷酸標記,在每個層中具有完整的全局上下文。我們探索更長上下文所帶來的益處,包括在基因組學中首次使用上下文學習,以便簡單地適應新任務而無需更新預訓練模型權重。在Nucleotide Transformer的微調基準測試中,HyenaDNA在17個數據集中有12個達到了最先進水平(SotA),使用的參數和預訓練數據量相比少了數個數量級。在GenomicBenchmarks上,HyenaDNA平均在8個數據集上超越SotA,準確率提高了9個百分點。
English
Genomic (DNA) sequences encode an enormous amount of information for gene
regulation and protein synthesis. Similar to natural language models,
researchers have proposed foundation models in genomics to learn generalizable
features from unlabeled genome data that can then be fine-tuned for downstream
tasks such as identifying regulatory elements. Due to the quadratic scaling of
attention, previous Transformer-based genomic models have used 512 to 4k tokens
as context (<0.001% of the human genome), significantly limiting the modeling
of long-range interactions in DNA. In addition, these methods rely on
tokenizers to aggregate meaningful DNA units, losing single nucleotide
resolution where subtle genetic variations can completely alter protein
function via single nucleotide polymorphisms (SNPs). Recently, Hyena, a large
language model based on implicit convolutions was shown to match attention in
quality while allowing longer context lengths and lower time complexity.
Leveraging Hyenas new long-range capabilities, we present HyenaDNA, a genomic
foundation model pretrained on the human reference genome with context lengths
of up to 1 million tokens at the single nucleotide-level, an up to 500x
increase over previous dense attention-based models. HyenaDNA scales
sub-quadratically in sequence length (training up to 160x faster than
Transformer), uses single nucleotide tokens, and has full global context at
each layer. We explore what longer context enables - including the first use of
in-context learning in genomics for simple adaptation to novel tasks without
updating pretrained model weights. On fine-tuned benchmarks from the Nucleotide
Transformer, HyenaDNA reaches state-of-the-art (SotA) on 12 of 17 datasets
using a model with orders of magnitude less parameters and pretraining data. On
the GenomicBenchmarks, HyenaDNA surpasses SotA on all 8 datasets on average by
+9 accuracy points.