Ankh3：結合序列去噪與補全的多任務預訓練提升蛋白質表徵能力

摘要

蛋白質語言模型（PLMs）已成為檢測蛋白質序列複雜模式的有力工具。然而，PLMs完全捕捉蛋白質序列信息的能力可能因專注於單一預訓練任務而受到限制。儘管增加數據模態或監督目標可以提高PLMs的性能，但預訓練通常仍集中在去噪損壞的序列上。為了突破PLMs的界限，我們的研究探索了一種多任務預訓練策略。我們開發了Ankh3，這是一個在兩個目標上聯合優化的模型：使用多種掩碼概率的掩碼語言建模和僅依賴蛋白質序列作為輸入的蛋白質序列補全。這種多任務預訓練表明，PLMs可以僅從蛋白質序列中學習到更豐富且更具泛化能力的表示。結果顯示，在下游任務（如二級結構預測、熒光、GB1適應性和接觸預測）中性能有所提升。多任務的整合使模型對蛋白質特性有了更全面的理解，從而實現了更穩健和準確的預測。

English

Protein language models (PLMs) have emerged as powerful tools to detect complex patterns of protein sequences. However, the capability of PLMs to fully capture information on protein sequences might be limited by focusing on single pre-training tasks. Although adding data modalities or supervised objectives can improve the performance of PLMs, pre-training often remains focused on denoising corrupted sequences. To push the boundaries of PLMs, our research investigated a multi-task pre-training strategy. We developed Ankh3, a model jointly optimized on two objectives: masked language modeling with multiple masking probabilities and protein sequence completion relying only on protein sequences as input. This multi-task pre-training demonstrated that PLMs can learn richer and more generalizable representations solely from protein sequences. The results demonstrated improved performance in downstream tasks, such as secondary structure prediction, fluorescence, GB1 fitness, and contact prediction. The integration of multiple tasks gave the model a more comprehensive understanding of protein properties, leading to more robust and accurate predictions.

Ankh3：結合序列去噪與補全的多任務預訓練提升蛋白質表徵能力

Ankh3: Multi-Task Pretraining with Sequence Denoising and Completion Enhances Protein Representations

摘要

Support