超越簡單串接：公平評估用於多鏈蛋白質相互作用預測的預訓練語言模型架構

摘要

蛋白質間相互作用（PPIs）是眾多細胞過程的基礎，其特性描述對於理解疾病機制和指導藥物發現至關重要。雖然蛋白質語言模型（PLMs）在預測蛋白質結構和功能方面已展現出顯著成功，但將其應用於基於序列的PPI結合親和力預測仍相對未被充分探索。這一差距通常歸因於高質量、嚴格精煉數據集的稀缺以及對簡單蛋白質表示串聯策略的依賴。在本研究中，我們針對這些限制進行了探討。首先，我們引入了一個精心整理的PPB-Affinity數據集版本，共包含8,207個獨特的蛋白質相互作用條目，通過解決多鏈蛋白相互作用中的註釋不一致和重複條目問題。該數據集採用了嚴格的序列同一性閾值（≤30%），以確保訓練、驗證和測試集的穩健劃分，最大限度地減少數據洩漏。其次，我們提出並系統評估了四種適應PLMs進行PPI結合親和力預測的架構：嵌入串聯（EC）、序列串聯（SC）、層次池化（HP）和池化注意力加法（PAD）。這些架構通過兩種訓練方法進行評估：全面微調和一種輕量級方法，即在凍結的PLM特徵上使用ConvBERT頭部。我們在多個領先的PLMs（ProtT5、ESM2、Ankh、Ankh2和ESM3）上進行的全面實驗表明，HP和PAD架構始終優於傳統的串聯方法，在Spearman相關性方面實現了高達12%的提升。這些結果強調了複雜架構設計的必要性，以充分利用PLMs進行細緻的PPI結合親和力預測。

English

Protein-protein interactions (PPIs) are fundamental to numerous cellular processes, and their characterization is vital for understanding disease mechanisms and guiding drug discovery. While protein language models (PLMs) have demonstrated remarkable success in predicting protein structure and function, their application to sequence-based PPI binding affinity prediction remains relatively underexplored. This gap is often attributed to the scarcity of high-quality, rigorously refined datasets and the reliance on simple strategies for concatenating protein representations. In this work, we address these limitations. First, we introduce a meticulously curated version of the PPB-Affinity dataset of a total of 8,207 unique protein-protein interaction entries, by resolving annotation inconsistencies and duplicate entries for multi-chain protein interactions. This dataset incorporates a stringent, less than or equal to 30%, sequence identity threshold to ensure robust splitting into training, validation, and test sets, minimizing data leakage. Second, we propose and systematically evaluate four architectures for adapting PLMs to PPI binding affinity prediction: embeddings concatenation (EC), sequences concatenation (SC), hierarchical pooling (HP), and pooled attention addition (PAD). These architectures were assessed using two training methods: full fine-tuning and a lightweight approach employing ConvBERT heads over frozen PLM features. Our comprehensive experiments across multiple leading PLMs (ProtT5, ESM2, Ankh, Ankh2, and ESM3) demonstrated that the HP and PAD architectures consistently outperform conventional concatenation methods, achieving up to 12% increase in terms of Spearman correlation. These results highlight the necessity of sophisticated architectural designs to fully exploit the capabilities of PLMs for nuanced PPI binding affinity prediction.

超越簡單串接：公平評估用於多鏈蛋白質相互作用預測的預訓練語言模型架構

Beyond Simple Concatenation: Fairly Assessing PLM Architectures for Multi-Chain Protein-Protein Interactions Prediction

摘要

Support