Ankh3: 시퀀스 노이즈 제거 및 완성을 통한 다중 작업 사전 학습이 단백질 표현을 향상시킴

초록

단백질 언어 모델(PLMs)은 단백질 서열의 복잡한 패턴을 탐지하는 강력한 도구로 부상했습니다. 그러나 단일 사전 학습 작업에 초점을 맞추는 것은 PLMs가 단백질 서열 정보를 완전히 포착하는 능력을 제한할 수 있습니다. 데이터 모달리티나 지도 목적 함수를 추가하면 PLMs의 성능을 향상시킬 수 있지만, 사전 학습은 여전히 손상된 서열의 노이즈 제거에 집중하는 경우가 많습니다. PLMs의 한계를 극복하기 위해, 우리 연구는 다중 작업 사전 학습 전략을 조사했습니다. 우리는 Ankh3라는 모델을 개발했는데, 이 모델은 다양한 마스킹 확률을 가진 마스크 언어 모델링과 단백질 서열만을 입력으로 하는 단백질 서열 완성이라는 두 가지 목적 함수를 공동으로 최적화합니다. 이 다중 작업 사전 학습은 PLMs가 단백질 서열만으로도 더 풍부하고 일반화 가능한 표현을 학습할 수 있음을 입증했습니다. 그 결과, 2차 구조 예측, 형광, GB1 적합성, 접촉 예측과 같은 하위 작업에서 성능이 향상되었습니다. 다중 작업의 통합은 모델이 단백질 특성을 더 포괄적으로 이해하도록 하여 더 견고하고 정확한 예측을 가능하게 했습니다.

English

Protein language models (PLMs) have emerged as powerful tools to detect complex patterns of protein sequences. However, the capability of PLMs to fully capture information on protein sequences might be limited by focusing on single pre-training tasks. Although adding data modalities or supervised objectives can improve the performance of PLMs, pre-training often remains focused on denoising corrupted sequences. To push the boundaries of PLMs, our research investigated a multi-task pre-training strategy. We developed Ankh3, a model jointly optimized on two objectives: masked language modeling with multiple masking probabilities and protein sequence completion relying only on protein sequences as input. This multi-task pre-training demonstrated that PLMs can learn richer and more generalizable representations solely from protein sequences. The results demonstrated improved performance in downstream tasks, such as secondary structure prediction, fluorescence, GB1 fitness, and contact prediction. The integration of multiple tasks gave the model a more comprehensive understanding of protein properties, leading to more robust and accurate predictions.

Ankh3: 시퀀스 노이즈 제거 및 완성을 통한 다중 작업 사전 학습이 단백질 표현을 향상시킴

Ankh3: Multi-Task Pretraining with Sequence Denoising and Completion Enhances Protein Representations

초록

Support