基于检索增强生成的基因扰动细胞反应预测

摘要

预测细胞对基因扰动的响应是理解基因功能、疾病机制及治疗开发的基础。尽管近期深度学习方法在模拟单细胞扰动响应方面展现出潜力，但由于生成过程中上下文信息有限，这些方法难以在不同细胞类型和扰动场景中实现泛化。我们提出PT-RAG（扰动感知双阶段检索增强生成）——一种将检索增强生成技术从传统语言模型应用拓展至细胞生物学的新框架。与基于预训练大语言模型进行文本检索的标准RAG系统不同，扰动检索缺乏成熟的相似性度量标准，且需要学习相关上下文的构成要素，这使得可微分检索变得至关重要。PT-RAG通过双阶段流程解决这一问题：首先利用GenePT嵌入检索候选扰动K，随后基于细胞状态和输入扰动条件，通过Gumbel-Softmax离散采样自适应优化选择。这种细胞类型感知的可微分检索实现了检索目标与生成过程的端到端联合优化。在Replogle-Nadig单基因扰动数据集上的实验表明，在相同实验条件下PT-RAG优于STATE和原始RAG方法，且在分布相似性指标（W_1、W_2）上提升最为显著。值得注意的是，原始RAG的显著失败本身即重要发现：它证明在该领域必须采用可微分的细胞类型感知检索，而简单检索反而会损害性能。我们的研究确立了检索增强生成作为模拟细胞对基因扰动响应的前沿范式。实验复现代码详见https://github.com/difra100/PT-RAG_ICLR。

English

Predicting how cells respond to genetic perturbations is fundamental to understanding gene function, disease mechanisms, and therapeutic development. While recent deep learning approaches have shown promise in modeling single-cell perturbation responses, they struggle to generalize across cell types and perturbation contexts due to limited contextual information during generation. We introduce PT-RAG (Perturbation-aware Two-stage Retrieval-Augmented Generation), a novel framework that extends Retrieval-Augmented Generation beyond traditional language-model applications to cellular biology. Unlike standard RAG systems designed for text retrieval with pre-trained LLMs, perturbation retrieval lacks established similarity metrics and requires learning what constitutes relevant context, making differentiable retrieval essential. PT-RAG addresses this through a two-stage pipeline: first, retrieving candidate perturbations K using GenePT embeddings, then adaptively refining the selection through Gumbel-Softmax discrete sampling conditioned on both the cell state and the input perturbation. This cell-type-aware differentiable retrieval enables end-to-end optimization of the retrieval objective jointly with generation. On the Replogle-Nadig single-gene perturbation dataset, we demonstrate that PT-RAG outperforms both STATE and vanilla RAG under identical experimental conditions, with the strongest gains in distributional similarity metrics (W_1, W_2). Notably, vanilla RAG's dramatic failure is itself a key finding: it demonstrates that differentiable, cell-type-aware retrieval is essential in this domain, and that naive retrieval can actively harm performance. Our results establish retrieval-augmented generation as a promising paradigm for modelling cellular responses to gene perturbation. The code to reproduce our experiments is available at https://github.com/difra100/PT-RAG_ICLR.