ChatPaper.aiChatPaper

MergeDNA:基于动态标记合并的上下文感知基因组建模方法

MergeDNA: Context-aware Genome Modeling with Dynamic Tokenization through Token Merging

November 17, 2025
作者: Siyuan Li, Kai Yu, Anna Wang, Zicheng Liu, Chang Yu, Jingbo Zhou, Qirong Yang, Yucheng Guo, Xiaoming Zhang, Stan Z. Li
cs.AI

摘要

基因组序列建模面临两大未解难题:不同区域的信息密度差异显著,且缺乏明确定义的最小词汇单元。现有方法依赖四种碱基或独立设计的DNA分词器,结合简单的掩码语言建模预训练,往往难以适应基因组序列的复杂度变化。本文通过引入令牌合并技术,提出一种联合优化动态基因组分词器与潜在Transformer的层次化架构,并配备上下文感知的预训练任务。在网络结构方面,分词模块通过堆叠多层具有局部窗口约束的可微分令牌合并块,将相邻碱基自动组词;潜在编码器则通过全注意力块捕捉这些合并词汇的全局上下文。通过对称部署潜在解码器与局部解码器,MergeDNA采用双重预训练任务:合并令牌重建任务同步训练动态分词模块并自适应筛选重要令牌,而自适应掩码令牌建模任务则学习预测这些被筛选的令牌以捕捉信息密集型内容。大量实验表明,MergeDNA在三大主流DNA基准测试和多项多组学任务中,无论经过微调还是零样本评估,均显著超越典型分词方法及大规模DNA基础模型。
English
Modeling genomic sequences faces two unsolved challenges: the information density varies widely across different regions, while there is no clearly defined minimum vocabulary unit. Relying on either four primitive bases or independently designed DNA tokenizers, existing approaches with naive masked language modeling pre-training often fail to adapt to the varying complexities of genomic sequences. Leveraging Token Merging techniques, this paper introduces a hierarchical architecture that jointly optimizes a dynamic genomic tokenizer and latent Transformers with context-aware pre-training tasks. As for network structures, the tokenization module automatically chunks adjacent bases into words by stacking multiple layers of the differentiable token merging blocks with local-window constraints, then a Latent Encoder captures the global context of these merged words by full-attention blocks. Symmetrically employing a Latent Decoder and a Local Decoder, MergeDNA learns with two pre-training tasks: Merged Token Reconstruction simultaneously trains the dynamic tokenization module and adaptively filters important tokens, while Adaptive Masked Token Modeling learns to predict these filtered tokens to capture informative contents. Extensive experiments show that MergeDNA achieves superior performance on three popular DNA benchmarks and several multi-omics tasks with fine-tuning or zero-shot evaluation, outperforming typical tokenization methods and large-scale DNA foundation models.
PDF82December 1, 2025