Ankh3：通过序列去噪与补全的多任务预训练提升蛋白质表征能力

摘要

蛋白质语言模型（PLMs）已成为检测蛋白质序列复杂模式的有力工具。然而，PLMs全面捕捉蛋白质序列信息的能力可能因专注于单一预训练任务而受限。尽管增加数据模态或监督目标可以提升PLMs的性能，但预训练通常仍集中于去噪受损序列。为了突破PLMs的界限，我们的研究探索了一种多任务预训练策略。我们开发了Ankh3模型，该模型在两项任务上联合优化：采用多种掩码概率的掩码语言建模，以及仅依赖蛋白质序列作为输入的序列补全。这种多任务预训练表明，PLMs能够仅从蛋白质序列中学习到更丰富且更具泛化性的表征。实验结果显示，在下游任务如二级结构预测、荧光、GB1适应度及接触预测中，模型性能均有所提升。多任务的整合使模型对蛋白质特性有了更全面的理解，从而实现了更稳健、更准确的预测。

English

Protein language models (PLMs) have emerged as powerful tools to detect complex patterns of protein sequences. However, the capability of PLMs to fully capture information on protein sequences might be limited by focusing on single pre-training tasks. Although adding data modalities or supervised objectives can improve the performance of PLMs, pre-training often remains focused on denoising corrupted sequences. To push the boundaries of PLMs, our research investigated a multi-task pre-training strategy. We developed Ankh3, a model jointly optimized on two objectives: masked language modeling with multiple masking probabilities and protein sequence completion relying only on protein sequences as input. This multi-task pre-training demonstrated that PLMs can learn richer and more generalizable representations solely from protein sequences. The results demonstrated improved performance in downstream tasks, such as secondary structure prediction, fluorescence, GB1 fitness, and contact prediction. The integration of multiple tasks gave the model a more comprehensive understanding of protein properties, leading to more robust and accurate predictions.

Ankh3：通过序列去噪与补全的多任务预训练提升蛋白质表征能力

Ankh3: Multi-Task Pretraining with Sequence Denoising and Completion Enhances Protein Representations

摘要

Support