基于检索增强生成的基因扰动细胞反应预测
Retrieval-Augmented Generation for Predicting Cellular Responses to Gene Perturbation
March 7, 2026
作者: Andrea Giuseppe Di Francesco, Andrea Rubbi, Pietro Liò
cs.AI
摘要
预测细胞如何响应基因扰动是理解基因功能、疾病机制和治疗开发的基础。尽管近期深度学习方法在模拟单细胞扰动响应方面展现出潜力,但由于生成过程中上下文信息有限,这些方法难以在不同细胞类型和扰动场景中实现泛化。我们提出PT-RAG(扰动感知双阶段检索增强生成)——一种创新框架,将检索增强生成技术从传统语言模型应用拓展至细胞生物学领域。与基于预训练大语言模型的标准文本检索RAG系统不同,扰动检索缺乏成熟的相似性度量标准,需要通过学习来定义相关上下文构成,这使得可微分检索成为关键。PT-RAG通过双阶段流程解决这一难题:首先利用GenePT嵌入检索候选扰动K,随后通过基于细胞状态和输入扰动的Gumbel-Softmax离散采样进行自适应筛选优化。这种细胞类型感知的可微分检索实现了检索目标与生成任务的端到端联合优化。在Replogle-Nadig单基因扰动数据集上的实验表明,在相同实验条件下PT-RAG的表现优于STATE模型和原始RAG模型,其中分布相似性指标(W_1、W_2)提升最为显著。值得注意的是,原始RAG的显著失败本身即重要发现:它证明在该领域必须采用细胞类型感知的可微分检索,而简单检索反而会损害性能。我们的研究确立了检索增强生成作为模拟细胞对基因扰动响应的前沿范式。实验复现代码详见https://github.com/difra100/PT-RAG_ICLR。
English
Predicting how cells respond to genetic perturbations is fundamental to understanding gene function, disease mechanisms, and therapeutic development. While recent deep learning approaches have shown promise in modeling single-cell perturbation responses, they struggle to generalize across cell types and perturbation contexts due to limited contextual information during generation. We introduce PT-RAG (Perturbation-aware Two-stage Retrieval-Augmented Generation), a novel framework that extends Retrieval-Augmented Generation beyond traditional language-model applications to cellular biology. Unlike standard RAG systems designed for text retrieval with pre-trained LLMs, perturbation retrieval lacks established similarity metrics and requires learning what constitutes relevant context, making differentiable retrieval essential. PT-RAG addresses this through a two-stage pipeline: first, retrieving candidate perturbations K using GenePT embeddings, then adaptively refining the selection through Gumbel-Softmax discrete sampling conditioned on both the cell state and the input perturbation. This cell-type-aware differentiable retrieval enables end-to-end optimization of the retrieval objective jointly with generation. On the Replogle-Nadig single-gene perturbation dataset, we demonstrate that PT-RAG outperforms both STATE and vanilla RAG under identical experimental conditions, with the strongest gains in distributional similarity metrics (W_1, W_2). Notably, vanilla RAG's dramatic failure is itself a key finding: it demonstrates that differentiable, cell-type-aware retrieval is essential in this domain, and that naive retrieval can actively harm performance. Our results establish retrieval-augmented generation as a promising paradigm for modelling cellular responses to gene perturbation. The code to reproduce our experiments is available at https://github.com/difra100/PT-RAG_ICLR.