迈向虚拟细胞中的自主机制推理

摘要

近期，大型语言模型作为加速科学发现的重要途径受到广泛关注。然而，其在生物学等开放型科学领域的应用仍受限，主要源于缺乏事实依据与可操作的解释机制。为此，我们提出一种面向虚拟细胞的结构化解释框架，将生物推理过程表征为机制作用图，从而实现系统化的验证与证伪。基于此，我们开发出VCR-Agent多智能体框架，该框架通过整合生物知识检索与基于验证器的过滤机制，实现自主生成并验证机理推理。利用该框架，我们发布VC-TRACES数据集，其中包含从Tahoe-100M图谱中提取的经过验证的机理解释。实验表明，采用这些解释进行训练能显著提升事实准确性，并为下游基因表达预测提供更有效的监督信号。这些成果印证了通过多智能体协同与严格验证实现可靠机理推理对虚拟细胞研究的关键价值。

English

Large language models (LLMs) have recently gained significant attention as a promising approach to accelerate scientific discovery. However, their application in open-ended scientific domains such as biology remains limited, primarily due to the lack of factually grounded and actionable explanations. To address this, we introduce a structured explanation formalism for virtual cells that represents biological reasoning as mechanistic action graphs, enabling systematic verification and falsification. Building upon this, we propose VCR-Agent, a multi-agent framework that integrates biologically grounded knowledge retrieval with a verifier-based filtering approach to generate and validate mechanistic reasoning autonomously. Using this framework, we release VC-TRACES dataset, which consists of verified mechanistic explanations derived from the Tahoe-100M atlas. Empirically, we demonstrate that training with these explanations improves factual precision and provides a more effective supervision signal for downstream gene expression prediction. These results underscore the importance of reliable mechanistic reasoning for virtual cells, achieved through the synergy of multi-agent and rigorous verification.

迈向虚拟细胞中的自主机制推理

Towards Autonomous Mechanistic Reasoning in Virtual Cells

摘要

Support