CausaLab: AI 과학자를 위한 대화형 인과 발견의 확장 가능한 환경

초록

CausaLab은 LLM 에이전트의 상호작용적 인과 발견을 평가하기 위한 확장 가능한 환경입니다. 기존 평가와 달리, CausaLab은 에이전트가 인과적 증거를 활용하여 문제를 해결할 수 있는지 여부와 정답이 충실하게 복구된 인과 메커니즘에 기반하는지 모두 평가합니다. 각 에피소드는 에이전트를 합성 실험실에 배치합니다: 에이전트는 사전 측정 기록을 받고, 조작 변수(manipulator crystal)에 개입하여, 동일한 메커니즘에 의해 제어되는 보류된 반응 변수(reactor crystal)의 공명 주파수를 예측합니다. 숨겨진 데이터 생성 과정은 무작위로 샘플링된 구조적 인과 모형(SCM)이므로, 성공하려면 사전 지식을 회상하는 것이 아니라 인과 그래프와 구조 방정식을 모두 복구해야 합니다. 실험 결과, 예측과 메커니즘 복구 사이에 지속적인 격차가 있음이 드러났습니다: 순수 관찰 기반 6-노드 설정에서 GPT-5.2-high는 92%의 작업 정확도에 도달했지만, 전체 엣지 F_1 점수는 0.471에 불과했습니다. 관찰-개입 혼합 전략은 구조적 충실도를 향상시키는 반면, 순수 개입은 강력한 에이전트에게도 여전히 어려운 과제로 남아 있습니다. 우리는 조기 중단(premature stopping)을 주요 약점으로 식별하고, 일관성 검증이 이를 완화함을 보여줍니다. 따라서 CausaLab은 예측적 성공을 인과적 이해와 분리하고, 실험적 인과 추론자로서의 현재 LLM 에이전트의 한계를 드러냅니다.

English

We introduce CausaLab, a scalable environment for evaluating interactive causal discovery by LLM agents. Unlike prior evaluations, CausaLab evaluates both whether an agent can solve a problem using causal evidence and whether its answer is grounded in a faithful recovered causal mechanism. Each episode places an agent in a synthetic laboratory: it receives prior measurement records, intervenes on a manipulator crystal, and predicts the resonance frequency of a held-out reactor crystal governed by the same mechanism. The hidden data-generating process is a randomly sampled structural causal model (SCM), so success requires recovering both a causal graph and structural equations rather than recalling prior knowledge. Experiments show a persistent gap between prediction and mechanism recovery: in the purely observational 6-node setting, GPT-5.2-high reaches 92% task accuracy but only 0.471 all-edge F_1. Mixed observation-intervention strategies improve structural fidelity, while pure intervention remains difficult even for strong agents. We identify premature stopping as a major weakness and show that consistency verification mitigates it. CausaLab therefore separates predictive success from causal understanding and exposes current LLM agents' limits as experimental causal reasoners.