Prot2Token: 다음 토큰 예측을 통한 단백질 모델링을 위한 통합 프레임워크

초록

단백질 예측 작업의 다양성으로 인해 전통적으로 특화된 모델이 필요했으며, 이는 광범위하게 적용 가능하고 계산 효율적인 단백질 언어 모델(PLM)의 개발을 방해해 왔습니다. 본 연구에서는 Prot2Token이라는 통합 프레임워크를 소개합니다. 이 프레임워크는 서열 수준의 특성과 잔기별 속성부터 복잡한 단백질 간 상호작용에 이르기까지 다양한 단백질 관련 예측을 표준화된 다음 토큰 예측 형식으로 변환하여 이러한 문제를 극복합니다. Prot2Token의 핵심은 사전 학습된 단백질 인코더의 임베딩과 학습 가능한 작업 토큰의 지도를 받아 다양한 예측을 수행하는 자기회귀 디코더를 사용합니다. 이 아키텍처는 다중 작업 학습을 독특하게 촉진하여 단일 모델이 여러 작업을 효율적으로 마스터할 수 있게 합니다. 다양한 벤치마크를 통해 광범위한 실험적 검증을 제시하며, Prot2Token이 다양한 유형의 단백질 예측 작업에서 강력한 예측 능력을 보여줌을 입증합니다. 주요 결과로는 상당한 속도 향상(예: MSA를 사용한 AlphaFold2 대비 거의 1000배)과 종종 특화된 접근법을 능가하거나 동등한 성능을 포함합니다. 더 나아가, 공간적으로 민감한 작업 성능을 개선하기 위한 보조적인 자기 지도 디코더 사전 학습 접근법을 소개합니다. 따라서 Prot2Token은 단백질 모델링을 위한 다목적 고처리량 패러다임으로의 중요한 진전을 제공하며, 생물학적 발견과 새로운 치료제 개발을 가속화할 것으로 기대됩니다. 코드는 https://github.com/mahdip72/prot2token에서 확인할 수 있습니다.

English

The diverse nature of protein prediction tasks has traditionally necessitated specialized models, hindering the development of broadly applicable and computationally efficient Protein Language Models (PLMs). In this work, we introduce Prot2Token, a unified framework that overcomes these challenges by converting a wide spectrum of protein-related predictions, from sequence-level properties and residue-specific attributes to complex inter-protein interactions, into a standardized next-token prediction format. At its core, Prot2Token employs an autoregressive decoder, conditioned on embeddings from pre-trained protein encoders and guided by learnable task tokens, to perform diverse predictions. This architecture uniquely facilitates multi-task learning, enabling a single model to master numerous tasks with improved efficiency. We present extensive experimental validation across a variety of benchmarks, demonstrating Prot2Tokens strong predictive power in different types of protein-prediction tasks. Key results include significant speedups (e.g., near 1000x over AlphaFold2 with MSA) and performance often matching or exceeding specialized approaches. Beyond that, we introduce an auxiliary self-supervised decoder pre-training approach to improve spatially sensitive task performance. Prot2Token thus offers a significant step towards a versatile, high-throughput paradigm for protein modeling, promising to accelerate biological discovery and the development of novel therapeutics. The code is available at https://github.com/mahdip72/prot2token .

Prot2Token: 다음 토큰 예측을 통한 단백질 모델링을 위한 통합 프레임워크

Prot2Token: A Unified Framework for Protein Modeling via Next-Token Prediction

초록

Support