CausaLab: 面向AI科学家的可扩展交互式因果发现环境
CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists
May 28, 2026
作者: Junlin Yang, Dylan Zhang, Xiangchen Song, Qirun Dai, Xiao Liu, Yuen Chen, Aniket Vashishtha, Jing Shi, Chenhao Tan, Hao Peng
cs.AI
摘要
我们推出了CausaLab——一个用于评估基于LLM的智能体进行交互式因果发现的可扩展环境。与先前的评估不同,CausaLab同时考察智能体能否利用因果证据解决问题,以及其答案是否基于忠实重建的因果机制。每一轮实验中,智能体被置于一个合成实验室中:它接收先前的观测记录,对操纵晶体进行干预,并预测由相同机制控制的一个独立反应器晶体的共振频率。隐藏的数据生成过程是一个随机采样的结构因果模型(SCM),因此成功完成任务需要恢复因果图和结构方程,而非依赖先验知识。
实验结果表明,预测与机制恢复之间存在持续差距:在纯观测的6节点设置中,GPT-5.2-high达到了92%的任务准确率,但全边F1分数仅为0.471。混合观测-干预策略可提升结构保真度,而纯干预策略即使对强智能体而言仍具挑战。我们识别出过早停止是一个主要弱点,并证明一致性验证可缓解该问题。因此,CausaLab将预测成功与因果理解分离开来,揭示了当前LLM智能体作为实验因果推理者的局限性。
English
We introduce CausaLab, a scalable environment for evaluating interactive causal discovery by LLM agents. Unlike prior evaluations, CausaLab evaluates both whether an agent can solve a problem using causal evidence and whether its answer is grounded in a faithful recovered causal mechanism. Each episode places an agent in a synthetic laboratory: it receives prior measurement records, intervenes on a manipulator crystal, and predicts the resonance frequency of a held-out reactor crystal governed by the same mechanism. The hidden data-generating process is a randomly sampled structural causal model (SCM), so success requires recovering both a causal graph and structural equations rather than recalling prior knowledge.
Experiments show a persistent gap between prediction and mechanism recovery: in the purely observational 6-node setting, GPT-5.2-high reaches 92% task accuracy but only 0.471 all-edge F_1. Mixed observation-intervention strategies improve structural fidelity, while pure intervention remains difficult even for strong agents. We identify premature stopping as a major weakness and show that consistency verification mitigates it. CausaLab therefore separates predictive success from causal understanding and exposes current LLM agents' limits as experimental causal reasoners.