CausaLab:一個面向AI科學家的可擴展互動式因果發現環境
CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists
May 28, 2026
作者: Junlin Yang, Dylan Zhang, Xiangchen Song, Qirun Dai, Xiao Liu, Yuen Chen, Aniket Vashishtha, Jing Shi, Chenhao Tan, Hao Peng
cs.AI
摘要
我們介紹CausaLab,一個用於評估LLM智能體進行互動式因果發現的可擴展環境。與先前評估不同,CausaLab同時評估智能體是否能利用因果證據解決問題,以及其答案是否奠基於忠實恢復的因果機制。每個回合將智能體置於一個合成實驗室中:它接收先前的測量記錄,對操縱器晶體進行干預,並預測由相同機制支配的獨立反應器晶體的共振頻率。隱藏的數據生成過程是一個隨機取樣的結構因果模型(SCM),因此成功需要恢復因果圖與結構方程,而非回憶先驗知識。
實驗顯示預測與機制恢復之間存在持續差距:在純觀測的6節點設定中,GPT-5.2-high達到92%的任務準確率,但全邊F1分數僅為0.471。混合觀測-干預策略提升了結構保真度,而純干預即使對強智能體仍具難度。我們識別出過早停止為主要弱點,並顯示一致性驗證可緩解此問題。因此,CausaLab將預測成功與因果理解區分開來,並揭露當前LLM智能體作為實驗因果推理者的局限性。
English
We introduce CausaLab, a scalable environment for evaluating interactive causal discovery by LLM agents. Unlike prior evaluations, CausaLab evaluates both whether an agent can solve a problem using causal evidence and whether its answer is grounded in a faithful recovered causal mechanism. Each episode places an agent in a synthetic laboratory: it receives prior measurement records, intervenes on a manipulator crystal, and predicts the resonance frequency of a held-out reactor crystal governed by the same mechanism. The hidden data-generating process is a randomly sampled structural causal model (SCM), so success requires recovering both a causal graph and structural equations rather than recalling prior knowledge.
Experiments show a persistent gap between prediction and mechanism recovery: in the purely observational 6-node setting, GPT-5.2-high reaches 92% task accuracy but only 0.471 all-edge F_1. Mixed observation-intervention strategies improve structural fidelity, while pure intervention remains difficult even for strong agents. We identify premature stopping as a major weakness and show that consistency verification mitigates it. CausaLab therefore separates predictive success from causal understanding and exposes current LLM agents' limits as experimental causal reasoners.