通用生物序列重排序提升从头肽段测序性能
Universal Biological Sequence Reranking for Improved De Novo Peptide Sequencing
May 23, 2025
作者: Zijie Qiu, Jiaqi Wei, Xiang Zhang, Sheng Xu, Kai Zou, Zhi Jin, Zhiqiang Gao, Nanqing Dong, Siqi Sun
cs.AI
摘要
从头肽段测序是蛋白质组学中的一项关键任务。然而,当前基于深度学习的方法受限于质谱数据固有的复杂性及噪声信号的异质分布,导致数据特异性偏差。我们提出了RankNovo,这是首个通过整合多种测序模型的互补优势来增强从头肽段测序的深度重排序框架。RankNovo采用列表式重排序策略,将候选肽段建模为多重序列比对,并利用轴向注意力机制提取跨候选者的信息特征。此外,我们引入了两个新指标——PMD(肽段质量偏差)和RMD(残基质量偏差),通过在序列和残基层面量化肽段间的质量差异,提供精细的监督。大量实验表明,RankNovo不仅超越了用于生成训练候选者的基础模型,还设立了新的最先进基准。更重要的是,RankNovo在训练过程中未接触的模型生成数据上展现出强大的零样本泛化能力,凸显了其作为肽段测序通用重排序框架的稳健性和潜力。我们的工作提出了一种新颖的重排序策略,从根本上挑战了现有的单一模型范式,并推动了准确从头测序的前沿发展。源代码已发布于GitHub平台。
English
De novo peptide sequencing is a critical task in proteomics. However, the
performance of current deep learning-based methods is limited by the inherent
complexity of mass spectrometry data and the heterogeneous distribution of
noise signals, leading to data-specific biases. We present RankNovo, the first
deep reranking framework that enhances de novo peptide sequencing by leveraging
the complementary strengths of multiple sequencing models. RankNovo employs a
list-wise reranking approach, modeling candidate peptides as multiple sequence
alignments and utilizing axial attention to extract informative features across
candidates. Additionally, we introduce two new metrics, PMD (Peptide Mass
Deviation) and RMD (residual Mass Deviation), which offer delicate supervision
by quantifying mass differences between peptides at both the sequence and
residue levels. Extensive experiments demonstrate that RankNovo not only
surpasses its base models used to generate training candidates for reranking
pre-training, but also sets a new state-of-the-art benchmark. Moreover,
RankNovo exhibits strong zero-shot generalization to unseen models whose
generations were not exposed during training, highlighting its robustness and
potential as a universal reranking framework for peptide sequencing. Our work
presents a novel reranking strategy that fundamentally challenges existing
single-model paradigms and advances the frontier of accurate de novo
sequencing. Our source code is provided on GitHub.Summary
AI-Generated Summary