단순 연결을 넘어서: 다중 체인 단백질-단백질 상호작용 예측을 위한 PLM 아키텍처의 공정한 평가

초록

단백질-단백질 상호작용(PPIs)은 다양한 세포 과정에서 핵심적인 역할을 하며, 이러한 상호작용을 규명하는 것은 질병 메커니즘을 이해하고 신약 개발을 안내하는 데 필수적입니다. 단백질 언어 모델(PLMs)은 단백질 구조와 기능 예측에서 뛰어난 성과를 보여왔지만, 서열 기반 PPI 결합 친화도 예측에 대한 적용은 상대적으로 덜 탐구된 분야입니다. 이러한 격차는 고품질의 엄격하게 정제된 데이터셋의 부족과 단백질 표현을 연결하는 단순한 전략에 의존하는 데 기인합니다. 본 연구에서는 이러한 한계를 해결하고자 합니다. 먼저, 총 8,207개의 고유한 단백질-단백질 상호작용 항목을 포함한 PPB-Affinity 데이터셋의 정밀하게 정제된 버전을 소개합니다. 이를 위해 주석 불일치와 다중 체인 단백질 상호작용의 중복 항목을 해결하였으며, 30% 이하의 엄격한 서열 동일성 임계값을 적용하여 훈련, 검증, 테스트 세트로의 견고한 분할을 보장하고 데이터 누출을 최소화했습니다. 둘째, PLMs를 PPI 결합 친화도 예측에 적응시키기 위한 네 가지 아키텍처를 제안하고 체계적으로 평가했습니다: 임베딩 연결(EC), 서열 연결(SC), 계층적 풀링(HP), 풀링된 어텐션 추가(PAD). 이러한 아키텍처는 전체 미세 조정과 고정된 PLM 특징 위에 ConvBERT 헤드를 사용한 경량화 접근법이라는 두 가지 훈련 방법으로 평가되었습니다. ProtT5, ESM2, Ankh, Ankh2, ESM3 등 여러 주요 PLMs에 걸친 포괄적인 실험을 통해 HP와 PAD 아키텍처가 기존의 연결 방법을 일관되게 능가하며, Spearman 상관관계 측면에서 최대 12%의 성능 향상을 달성함을 입증했습니다. 이러한 결과는 세밀한 PPI 결합 친화도 예측을 위해 PLMs의 잠재력을 최대한 활용하기 위해서는 정교한 아키텍처 설계가 필요함을 강조합니다.

English

Protein-protein interactions (PPIs) are fundamental to numerous cellular processes, and their characterization is vital for understanding disease mechanisms and guiding drug discovery. While protein language models (PLMs) have demonstrated remarkable success in predicting protein structure and function, their application to sequence-based PPI binding affinity prediction remains relatively underexplored. This gap is often attributed to the scarcity of high-quality, rigorously refined datasets and the reliance on simple strategies for concatenating protein representations. In this work, we address these limitations. First, we introduce a meticulously curated version of the PPB-Affinity dataset of a total of 8,207 unique protein-protein interaction entries, by resolving annotation inconsistencies and duplicate entries for multi-chain protein interactions. This dataset incorporates a stringent, less than or equal to 30%, sequence identity threshold to ensure robust splitting into training, validation, and test sets, minimizing data leakage. Second, we propose and systematically evaluate four architectures for adapting PLMs to PPI binding affinity prediction: embeddings concatenation (EC), sequences concatenation (SC), hierarchical pooling (HP), and pooled attention addition (PAD). These architectures were assessed using two training methods: full fine-tuning and a lightweight approach employing ConvBERT heads over frozen PLM features. Our comprehensive experiments across multiple leading PLMs (ProtT5, ESM2, Ankh, Ankh2, and ESM3) demonstrated that the HP and PAD architectures consistently outperform conventional concatenation methods, achieving up to 12% increase in terms of Spearman correlation. These results highlight the necessity of sophisticated architectural designs to fully exploit the capabilities of PLMs for nuanced PPI binding affinity prediction.

단순 연결을 넘어서: 다중 체인 단백질-단백질 상호작용 예측을 위한 PLM 아키텍처의 공정한 평가

Beyond Simple Concatenation: Fairly Assessing PLM Architectures for Multi-Chain Protein-Protein Interactions Prediction

초록

Support