超越简单拼接：公平评估用于多链蛋白质-蛋白质相互作用预测的预训练语言模型架构

摘要

蛋白质-蛋白质相互作用（PPIs）是众多细胞过程的基础，其表征对于理解疾病机制和指导药物发现至关重要。尽管蛋白质语言模型（PLMs）在预测蛋白质结构和功能方面取得了显著成功，但它们在基于序列的PPI结合亲和力预测中的应用仍相对较少。这一差距通常归因于高质量、严格精炼数据集的稀缺以及对简单蛋白质表示串联策略的依赖。在本研究中，我们针对这些局限性进行了改进。首先，我们引入了一个精心整理的PPB-Affinity数据集版本，包含总计8,207个独特的蛋白质-蛋白质相互作用条目，通过解决多链蛋白质相互作用中的注释不一致和重复条目问题。该数据集采用了严格的序列同一性阈值（≤30%），以确保训练集、验证集和测试集的稳健划分，最大限度地减少数据泄露。其次，我们提出并系统评估了四种将PLMs应用于PPI结合亲和力预测的架构：嵌入串联（EC）、序列串联（SC）、层次池化（HP）和池化注意力加法（PAD）。这些架构通过两种训练方法进行评估：全微调和一种轻量级方法，即在冻结的PLM特征上使用ConvBERT头。我们在多个领先的PLMs（ProtT5、ESM2、Ankh、Ankh2和ESM3）上进行的全面实验表明，HP和PAD架构始终优于传统的串联方法，在Spearman相关系数上实现了高达12%的提升。这些结果强调了复杂架构设计的必要性，以充分利用PLMs在精细PPI结合亲和力预测中的潜力。

English

Protein-protein interactions (PPIs) are fundamental to numerous cellular processes, and their characterization is vital for understanding disease mechanisms and guiding drug discovery. While protein language models (PLMs) have demonstrated remarkable success in predicting protein structure and function, their application to sequence-based PPI binding affinity prediction remains relatively underexplored. This gap is often attributed to the scarcity of high-quality, rigorously refined datasets and the reliance on simple strategies for concatenating protein representations. In this work, we address these limitations. First, we introduce a meticulously curated version of the PPB-Affinity dataset of a total of 8,207 unique protein-protein interaction entries, by resolving annotation inconsistencies and duplicate entries for multi-chain protein interactions. This dataset incorporates a stringent, less than or equal to 30%, sequence identity threshold to ensure robust splitting into training, validation, and test sets, minimizing data leakage. Second, we propose and systematically evaluate four architectures for adapting PLMs to PPI binding affinity prediction: embeddings concatenation (EC), sequences concatenation (SC), hierarchical pooling (HP), and pooled attention addition (PAD). These architectures were assessed using two training methods: full fine-tuning and a lightweight approach employing ConvBERT heads over frozen PLM features. Our comprehensive experiments across multiple leading PLMs (ProtT5, ESM2, Ankh, Ankh2, and ESM3) demonstrated that the HP and PAD architectures consistently outperform conventional concatenation methods, achieving up to 12% increase in terms of Spearman correlation. These results highlight the necessity of sophisticated architectural designs to fully exploit the capabilities of PLMs for nuanced PPI binding affinity prediction.

超越简单拼接：公平评估用于多链蛋白质-蛋白质相互作用预测的预训练语言模型架构

Beyond Simple Concatenation: Fairly Assessing PLM Architectures for Multi-Chain Protein-Protein Interactions Prediction

摘要

Support