Ankh3: 配列ノイズ除去と補完によるマルチタスク事前学習がタンパク質表現を強化

要旨

タンパク質言語モデル（PLM）は、タンパク質配列の複雑なパターンを検出する強力なツールとして登場しました。しかし、単一の事前学習タスクに焦点を当てることで、PLMがタンパク質配列の情報を完全に捕捉する能力は制限される可能性があります。データモダリティや教師あり目的を追加することでPLMの性能を向上させることができますが、事前学習はしばしばノイズの多い配列の復元に焦点を当てたままです。PLMの限界を押し広げるため、我々の研究ではマルチタスク事前学習戦略を調査しました。我々はAnkh3を開発し、複数のマスキング確率を用いたマスク言語モデリングと、タンパク質配列のみを入力とするタンパク質配列補完という2つの目的を共同で最適化しました。このマルチタスク事前学習により、PLMがタンパク質配列のみからより豊かで汎用性の高い表現を学習できることが実証されました。その結果、二次構造予測、蛍光、GB1フィットネス、接触予測などの下流タスクにおいて性能が向上しました。複数のタスクを統合することで、モデルはタンパク質の特性をより包括的に理解し、より頑健で正確な予測を実現しました。

English

Protein language models (PLMs) have emerged as powerful tools to detect complex patterns of protein sequences. However, the capability of PLMs to fully capture information on protein sequences might be limited by focusing on single pre-training tasks. Although adding data modalities or supervised objectives can improve the performance of PLMs, pre-training often remains focused on denoising corrupted sequences. To push the boundaries of PLMs, our research investigated a multi-task pre-training strategy. We developed Ankh3, a model jointly optimized on two objectives: masked language modeling with multiple masking probabilities and protein sequence completion relying only on protein sequences as input. This multi-task pre-training demonstrated that PLMs can learn richer and more generalizable representations solely from protein sequences. The results demonstrated improved performance in downstream tasks, such as secondary structure prediction, fluorescence, GB1 fitness, and contact prediction. The integration of multiple tasks gave the model a more comprehensive understanding of protein properties, leading to more robust and accurate predictions.

Ankh3: 配列ノイズ除去と補完によるマルチタスク事前学習がタンパク質表現を強化

Ankh3: Multi-Task Pretraining with Sequence Denoising and Completion Enhances Protein Representations

要旨

Support