単純な連結を超えて：多鎖タンパク質間相互作用予測のためのPLMアーキテクチャの公平な評価

要旨

タンパク質間相互作用（PPI）は、数多くの細胞プロセスにおいて基本的な役割を果たしており、その特性を明らかにすることは、疾患のメカニズムを理解し、創薬を導く上で極めて重要です。タンパク質言語モデル（PLM）は、タンパク質の構造や機能の予測において顕著な成功を収めてきましたが、配列ベースのPPI結合親和性予測への応用はまだ十分に探求されていません。このギャップは、高品質で厳密に精選されたデータセットの不足や、タンパク質表現を単純に連結する戦略に依存していることが原因とされています。本研究では、これらの制約に対処します。まず、アノテーションの不整合や多鎖タンパク質相互作用の重複エントリを解決し、合計8,207のユニークなPPIエントリを含むPPB-Affinityデータセットの厳選版を導入します。このデータセットは、30%以下の配列同一性閾値を採用し、トレーニング、検証、テストセットへの堅牢な分割を保証し、データ漏洩を最小限に抑えます。次に、PLMをPPI結合親和性予測に適応させるための4つのアーキテクチャを提案し、体系的に評価します。これらは、埋め込み連結（EC）、配列連結（SC）、階層的プーリング（HP）、およびプールされた注意追加（PAD）です。これらのアーキテクチャは、フルファインチューニングと、凍結されたPLM特徴量上にConvBERTヘッドを適用する軽量アプローチという2つのトレーニング方法を用いて評価されました。複数の主要なPLM（ProtT5、ESM2、Ankh、Ankh2、ESM3）にわたる包括的な実験により、HPおよびPADアーキテクチャが従来の連結方法を一貫して上回り、Spearman相関係数で最大12%の向上を達成することが示されました。これらの結果は、微妙なPPI結合親和性予測においてPLMの能力を最大限に活用するためには、洗練されたアーキテクチャ設計が必要であることを強調しています。

English

Protein-protein interactions (PPIs) are fundamental to numerous cellular processes, and their characterization is vital for understanding disease mechanisms and guiding drug discovery. While protein language models (PLMs) have demonstrated remarkable success in predicting protein structure and function, their application to sequence-based PPI binding affinity prediction remains relatively underexplored. This gap is often attributed to the scarcity of high-quality, rigorously refined datasets and the reliance on simple strategies for concatenating protein representations. In this work, we address these limitations. First, we introduce a meticulously curated version of the PPB-Affinity dataset of a total of 8,207 unique protein-protein interaction entries, by resolving annotation inconsistencies and duplicate entries for multi-chain protein interactions. This dataset incorporates a stringent, less than or equal to 30%, sequence identity threshold to ensure robust splitting into training, validation, and test sets, minimizing data leakage. Second, we propose and systematically evaluate four architectures for adapting PLMs to PPI binding affinity prediction: embeddings concatenation (EC), sequences concatenation (SC), hierarchical pooling (HP), and pooled attention addition (PAD). These architectures were assessed using two training methods: full fine-tuning and a lightweight approach employing ConvBERT heads over frozen PLM features. Our comprehensive experiments across multiple leading PLMs (ProtT5, ESM2, Ankh, Ankh2, and ESM3) demonstrated that the HP and PAD architectures consistently outperform conventional concatenation methods, achieving up to 12% increase in terms of Spearman correlation. These results highlight the necessity of sophisticated architectural designs to fully exploit the capabilities of PLMs for nuanced PPI binding affinity prediction.

単純な連結を超えて：多鎖タンパク質間相互作用予測のためのPLMアーキテクチャの公平な評価

Beyond Simple Concatenation: Fairly Assessing PLM Architectures for Multi-Chain Protein-Protein Interactions Prediction

要旨

Support