MergeDNA:基于动态标记合并的上下文感知基因组建模方法
MergeDNA: Context-aware Genome Modeling with Dynamic Tokenization through Token Merging
November 17, 2025
作者: Siyuan Li, Kai Yu, Anna Wang, Zicheng Liu, Chang Yu, Jingbo Zhou, Qirong Yang, Yucheng Guo, Xiaoming Zhang, Stan Z. Li
cs.AI
摘要
基因組序列建模面臨兩大未解難題:不同區域的信息密度差異懸殊,且缺乏明確定義的最小詞彙單元。現有方法依賴四種鹼基或獨立設計的DNA分詞器,結合樸素的掩碼語言建模預訓練,往往難以適應基因序列的複雜度變化。本文通過引入令牌合併技術,提出一種能聯合優化動態基因分詞器與潛在Transformer的層次式架構,並配備上下文感知的預訓練任務。在網絡結構方面,分詞模塊通過堆疊多層具局部窗口約束的可微分令牌合併塊,將相鄰鹼基自動組合成詞彙;隨後潛在編碼器通過全局注意力塊捕獲這些合併詞彙的上下文語義。通過對稱部署潛在解碼器與局部解碼器,MergeDNA採用雙重預訓練任務:合併令牌重構任務同步訓練動態分詞模塊並自適應篩選重要令牌,而自適應掩碼令牌建模任務則學習預測被篩選的令牌以捕捉信息密集型內容。大量實驗表明,MergeDNA在三個主流DNA基準測試及多個多組學任務中,無論是微調還是零樣本評估均表現優異,其性能超越典型分詞方法與大規模DNA基礎模型。
English
Modeling genomic sequences faces two unsolved challenges: the information density varies widely across different regions, while there is no clearly defined minimum vocabulary unit. Relying on either four primitive bases or independently designed DNA tokenizers, existing approaches with naive masked language modeling pre-training often fail to adapt to the varying complexities of genomic sequences. Leveraging Token Merging techniques, this paper introduces a hierarchical architecture that jointly optimizes a dynamic genomic tokenizer and latent Transformers with context-aware pre-training tasks. As for network structures, the tokenization module automatically chunks adjacent bases into words by stacking multiple layers of the differentiable token merging blocks with local-window constraints, then a Latent Encoder captures the global context of these merged words by full-attention blocks. Symmetrically employing a Latent Decoder and a Local Decoder, MergeDNA learns with two pre-training tasks: Merged Token Reconstruction simultaneously trains the dynamic tokenization module and adaptively filters important tokens, while Adaptive Masked Token Modeling learns to predict these filtered tokens to capture informative contents. Extensive experiments show that MergeDNA achieves superior performance on three popular DNA benchmarks and several multi-omics tasks with fine-tuning or zero-shot evaluation, outperforming typical tokenization methods and large-scale DNA foundation models.