Ankh3:通过序列去噪与补全的多任务预训练提升蛋白质表征能力
Ankh3: Multi-Task Pretraining with Sequence Denoising and Completion Enhances Protein Representations
May 26, 2025
作者: Hazem Alsamkary, Mohamed Elshaffei, Mohamed Elkerdawy, Ahmed Elnaggar
cs.AI
摘要
蛋白质语言模型(PLMs)已成为检测蛋白质序列复杂模式的有力工具。然而,PLMs全面捕捉蛋白质序列信息的能力可能因专注于单一预训练任务而受限。尽管增加数据模态或监督目标可以提升PLMs的性能,但预训练通常仍集中于去噪受损序列。为了突破PLMs的界限,我们的研究探索了一种多任务预训练策略。我们开发了Ankh3模型,该模型在两项任务上联合优化:采用多种掩码概率的掩码语言建模,以及仅依赖蛋白质序列作为输入的序列补全。这种多任务预训练表明,PLMs能够仅从蛋白质序列中学习到更丰富且更具泛化性的表征。实验结果显示,在下游任务如二级结构预测、荧光、GB1适应度及接触预测中,模型性能均有所提升。多任务的整合使模型对蛋白质特性有了更全面的理解,从而实现了更稳健、更准确的预测。
English
Protein language models (PLMs) have emerged as powerful tools to detect
complex patterns of protein sequences. However, the capability of PLMs to fully
capture information on protein sequences might be limited by focusing on single
pre-training tasks. Although adding data modalities or supervised objectives
can improve the performance of PLMs, pre-training often remains focused on
denoising corrupted sequences. To push the boundaries of PLMs, our research
investigated a multi-task pre-training strategy. We developed Ankh3, a model
jointly optimized on two objectives: masked language modeling with multiple
masking probabilities and protein sequence completion relying only on protein
sequences as input. This multi-task pre-training demonstrated that PLMs can
learn richer and more generalizable representations solely from protein
sequences. The results demonstrated improved performance in downstream tasks,
such as secondary structure prediction, fluorescence, GB1 fitness, and contact
prediction. The integration of multiple tasks gave the model a more
comprehensive understanding of protein properties, leading to more robust and
accurate predictions.Summary
AI-Generated Summary