HyenaDNA：单核苷酸分辨率下的长距离基因组序列建模

摘要

基因组（DNA）序列编码了大量关于基因调控和蛋白质合成的信息。类似自然语言模型，研究人员提出了基因组学中的基础模型，以从未标记的基因组数据中学习可泛化特征，然后对其进行微调，用于识别调控元素等下游任务。由于注意力的二次扩展，先前基于Transformer的基因组模型使用512到4k个标记作为上下文（<0.001%的人类基因组），严重限制了对DNA中长程相互作用的建模。此外，这些方法依赖于分词器来聚合有意义的DNA单元，丢失了单核苷酸分辨率，其中微小的遗传变异可以通过单核苷酸多态性（SNPs）完全改变蛋白功能。最近，基于隐式卷积的大型语言模型Hyena展示了与注意力相匹配的质量，同时允许更长的上下文长度和更低的时间复杂度。利用Hyena新的长程能力，我们提出了HyenaDNA，这是一个基因组基础模型，使用人类参考基因组进行预训练，上下文长度可达到100万个标记的单核苷酸级别，比先前基于密集注意力的模型提高了500倍。HyenaDNA在序列长度上呈次二次方缩放（训练速度比Transformer快160倍），使用单核苷酸标记，并在每一层具有完整的全局上下文。我们探索更长上下文可以实现的内容，包括在基因组学中首次使用上下文内学习，以便简单地适应新任务而无需更新预训练模型权重。在来自Nucleotide Transformer的微调基准上，HyenaDNA在17个数据集中有12个达到了最先进水平（SotA），使用的模型参数和预训练数据量级较少。在GenomicBenchmarks上，HyenaDNA在8个数据集上平均超过SotA，准确度提高了9个点。

English

Genomic (DNA) sequences encode an enormous amount of information for gene regulation and protein synthesis. Similar to natural language models, researchers have proposed foundation models in genomics to learn generalizable features from unlabeled genome data that can then be fine-tuned for downstream tasks such as identifying regulatory elements. Due to the quadratic scaling of attention, previous Transformer-based genomic models have used 512 to 4k tokens as context (<0.001% of the human genome), significantly limiting the modeling of long-range interactions in DNA. In addition, these methods rely on tokenizers to aggregate meaningful DNA units, losing single nucleotide resolution where subtle genetic variations can completely alter protein function via single nucleotide polymorphisms (SNPs). Recently, Hyena, a large language model based on implicit convolutions was shown to match attention in quality while allowing longer context lengths and lower time complexity. Leveraging Hyenas new long-range capabilities, we present HyenaDNA, a genomic foundation model pretrained on the human reference genome with context lengths of up to 1 million tokens at the single nucleotide-level, an up to 500x increase over previous dense attention-based models. HyenaDNA scales sub-quadratically in sequence length (training up to 160x faster than Transformer), uses single nucleotide tokens, and has full global context at each layer. We explore what longer context enables - including the first use of in-context learning in genomics for simple adaptation to novel tasks without updating pretrained model weights. On fine-tuned benchmarks from the Nucleotide Transformer, HyenaDNA reaches state-of-the-art (SotA) on 12 of 17 datasets using a model with orders of magnitude less parameters and pretraining data. On the GenomicBenchmarks, HyenaDNA surpasses SotA on all 8 datasets on average by +9 accuracy points.

HyenaDNA：单核苷酸分辨率下的长距离基因组序列建模

HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution

摘要

Support