細胞鍛造：虛擬細胞模型的能動性設計

摘要

虚拟细胞建模代表了人工智能与生物学交叉领域的一个新兴前沿，旨在定量预测诸如对多种扰动的响应等量值。然而，由于生物系统的复杂性、数据模态的异质性以及跨多个学科领域专业知识的需求，自主构建虚拟细胞的计算模型颇具挑战。本文介绍了一种名为CellForge的代理系统，该系统利用多代理框架，将提供的生物数据集和研究目标直接转化为优化的虚拟细胞计算模型。具体而言，仅需输入原始的单细胞多组学数据和任务描述，CellForge即可输出优化的模型架构及用于训练虚拟细胞模型和推理的可执行代码。该框架整合了三个核心模块：任务分析模块，用于对提供的数据集进行特征描述及检索相关文献；方法设计模块，其中专门化的代理协作开发优化的建模策略；以及实验执行模块，用于自动化生成代码。设计模块中的代理被划分为具有不同视角的专家和一位中央协调者，他们需协作交换解决方案直至达成合理共识。我们通过使用涵盖基因敲除、药物治疗和细胞因子刺激等多种模态的六个不同数据集，展示了CellForge在单细胞扰动预测中的能力。CellForge在各项任务中均优于特定任务的最先进方法。总体而言，CellForge展示了具有不同视角的大型语言模型代理之间的迭代交互如何比直接应对建模挑战提供更优解决方案。我们的代码公开于https://github.com/gersteinlab/CellForge。

English

Virtual cell modeling represents an emerging frontier at the intersection of artificial intelligence and biology, aiming to predict quantities such as responses to diverse perturbations quantitatively. However, autonomously building computational models for virtual cells is challenging due to the complexity of biological systems, the heterogeneity of data modalities, and the need for domain-specific expertise across multiple disciplines. Here, we introduce CellForge, an agentic system that leverages a multi-agent framework that transforms presented biological datasets and research objectives directly into optimized computational models for virtual cells. More specifically, given only raw single-cell multi-omics data and task descriptions as input, CellForge outputs both an optimized model architecture and executable code for training virtual cell models and inference. The framework integrates three core modules: Task Analysis for presented dataset characterization and relevant literature retrieval, Method Design, where specialized agents collaboratively develop optimized modeling strategies, and Experiment Execution for automated generation of code. The agents in the Design module are separated into experts with differing perspectives and a central moderator, and have to collaboratively exchange solutions until they achieve a reasonable consensus. We demonstrate CellForge's capabilities in single-cell perturbation prediction, using six diverse datasets that encompass gene knockouts, drug treatments, and cytokine stimulations across multiple modalities. CellForge consistently outperforms task-specific state-of-the-art methods. Overall, CellForge demonstrates how iterative interaction between LLM agents with differing perspectives provides better solutions than directly addressing a modeling challenge. Our code is publicly available at https://github.com/gersteinlab/CellForge.

細胞鍛造：虛擬細胞模型的能動性設計

CellForge: Agentic Design of Virtual Cell Models

摘要

Support