CausaLab: AI科学者向けの対話型因果発見のためのスケーラブルな環境

要旨

我々は、LLMエージェントによる対話的因果発見を評価するためのスケーラブルな環境であるCausaLabを紹介する。従来の評価とは異なり、CausaLabはエージェントが因果的証拠を用いて問題を解決できるかどうか、そしてその回答が忠実に回復された因果メカニズムに基づいているかどうかの両方を評価する。各エピソードでは、エージェントを合成実験室に配置する。エージェントは事前の測定記録を受け取り、操作子結晶に介入し、同じメカニズムに支配された保留中の反応器結晶の共振周波数を予測する。隠されたデータ生成プロセスはランダムにサンプリングされた構造的因果モデル（SCM）であり、成功するには事前知識を想起するのではなく、因果グラフと構造方程式の両方を回復する必要がある。実験では、予測とメカニズム回復の間に持続的なギャップが見られる。純粋な観測設定の6ノードの場合、GPT-5.2-highはタスク精度92%に達するが、全エッジF1は0.471に過ぎない。観測と介入を混合した戦略は構造的忠実度を向上させるが、純粋な介入は強力なエージェントにとっても困難である。我々は早期終了を主要な弱点として特定し、一貫性検証がそれを軽減することを示す。したがって、CausaLabは予測的成功と因果理解を分離し、現在のLLMエージェントが実験的因果推論者として持つ限界を明らかにする。

English

We introduce CausaLab, a scalable environment for evaluating interactive causal discovery by LLM agents. Unlike prior evaluations, CausaLab evaluates both whether an agent can solve a problem using causal evidence and whether its answer is grounded in a faithful recovered causal mechanism. Each episode places an agent in a synthetic laboratory: it receives prior measurement records, intervenes on a manipulator crystal, and predicts the resonance frequency of a held-out reactor crystal governed by the same mechanism. The hidden data-generating process is a randomly sampled structural causal model (SCM), so success requires recovering both a causal graph and structural equations rather than recalling prior knowledge. Experiments show a persistent gap between prediction and mechanism recovery: in the purely observational 6-node setting, GPT-5.2-high reaches 92% task accuracy but only 0.471 all-edge F_1. Mixed observation-intervention strategies improve structural fidelity, while pure intervention remains difficult even for strong agents. We identify premature stopping as a major weakness and show that consistency verification mitigates it. CausaLab therefore separates predictive success from causal understanding and exposes current LLM agents' limits as experimental causal reasoners.