在細胞中學習分子表示
Learning Molecular Representation in a Cell
June 17, 2024
作者: Gang Liu, Srijit Seal, John Arevalo, Zhenwen Liang, Anne E. Carpenter, Meng Jiang, Shantanu Singh
cs.AI
摘要
預測藥物在體內的療效和安全性需要有關生物反應(例如細胞形態和基因表達)對小分子干擾的信息。然而,目前的分子表示學習方法並未提供對這些干擾下細胞狀態的全面觀察,且難以去除噪音,妨礙模型的泛化。我們引入信息對齊(InfoAlign)方法,通過信息瓶頸方法在細胞中學習分子表示。我們將分子和細胞反應數據作為節點整合到上下文圖中,並根據化學、生物和計算標準連接它們,形成加權邊。對於訓練批次中的每個分子,InfoAlign通過最小化目標來優化編碼器的潛在表示,以丟棄多餘的結構信息。一個充分性目標對表示進行解碼,使其與上下文圖中分子鄰域的不同特徵空間對齊。我們證明了所提出的對齊充分性目標比現有基於編碼器的對比方法更緊密。從實證上看,我們在兩個下游任務中驗證了InfoAlign的表示:與四個數據集中高達19種基線方法進行分子性質預測,以及零樣本分子形態匹配。
English
Predicting drug efficacy and safety in vivo requires information on
biological responses (e.g., cell morphology and gene expression) to small
molecule perturbations. However, current molecular representation learning
methods do not provide a comprehensive view of cell states under these
perturbations and struggle to remove noise, hindering model generalization. We
introduce the Information Alignment (InfoAlign) approach to learn molecular
representations through the information bottleneck method in cells. We
integrate molecules and cellular response data as nodes into a context graph,
connecting them with weighted edges based on chemical, biological, and
computational criteria. For each molecule in a training batch, InfoAlign
optimizes the encoder's latent representation with a minimality objective to
discard redundant structural information. A sufficiency objective decodes the
representation to align with different feature spaces from the molecule's
neighborhood in the context graph. We demonstrate that the proposed sufficiency
objective for alignment is tighter than existing encoder-based contrastive
methods. Empirically, we validate representations from InfoAlign in two
downstream tasks: molecular property prediction against up to 19 baseline
methods across four datasets, plus zero-shot molecule-morphology matching.Summary
AI-Generated Summary