CausaLab: 面向AI科学家的可扩展交互式因果发现环境

摘要

我们推出了CausaLab——一个用于评估基于LLM的智能体进行交互式因果发现的可扩展环境。与先前的评估不同，CausaLab同时考察智能体能否利用因果证据解决问题，以及其答案是否基于忠实重建的因果机制。每一轮实验中，智能体被置于一个合成实验室中：它接收先前的观测记录，对操纵晶体进行干预，并预测由相同机制控制的一个独立反应器晶体的共振频率。隐藏的数据生成过程是一个随机采样的结构因果模型（SCM），因此成功完成任务需要恢复因果图和结构方程，而非依赖先验知识。实验结果表明，预测与机制恢复之间存在持续差距：在纯观测的6节点设置中，GPT-5.2-high达到了92%的任务准确率，但全边F1分数仅为0.471。混合观测-干预策略可提升结构保真度，而纯干预策略即使对强智能体而言仍具挑战。我们识别出过早停止是一个主要弱点，并证明一致性验证可缓解该问题。因此，CausaLab将预测成功与因果理解分离开来，揭示了当前LLM智能体作为实验因果推理者的局限性。

English

We introduce CausaLab, a scalable environment for evaluating interactive causal discovery by LLM agents. Unlike prior evaluations, CausaLab evaluates both whether an agent can solve a problem using causal evidence and whether its answer is grounded in a faithful recovered causal mechanism. Each episode places an agent in a synthetic laboratory: it receives prior measurement records, intervenes on a manipulator crystal, and predicts the resonance frequency of a held-out reactor crystal governed by the same mechanism. The hidden data-generating process is a randomly sampled structural causal model (SCM), so success requires recovering both a causal graph and structural equations rather than recalling prior knowledge. Experiments show a persistent gap between prediction and mechanism recovery: in the purely observational 6-node setting, GPT-5.2-high reaches 92% task accuracy but only 0.471 all-edge F_1. Mixed observation-intervention strategies improve structural fidelity, while pure intervention remains difficult even for strong agents. We identify premature stopping as a major weakness and show that consistency verification mitigates it. CausaLab therefore separates predictive success from causal understanding and exposes current LLM agents' limits as experimental causal reasoners.