텍스트-이미지 모델은 귀납주의적 칠면조인가? 인과 추론을 위한 반사실적 벤치마크

초록

텍스트-이미지(T2I) 생성 모델은 자연어 프롬프트로부터 시각적으로 사실적인 이미지를 생성하는 데 있어 놀라운 진전을 이루었다. 그러나 이러한 성공이 진정한 인과적 이해를 반영하는지, 아니면 시각-텍스트 상관관계에 대한 정교한 패턴 매칭을 반영하는지는 여전히 불분명하다. 러셀의 귀납주의적 칠면조(Russell's inductivist turkey)에서 영감을 얻어, 우리는 텍스트-이미지 모델이 현실 세계의 사전 지식과 체계적으로 모순되는 규칙 하에서 이미지를 생성할 수 있는지 조사하기 위해 반사실적 벤치마크인 Counterfactual-World(CF-World)를 도입한다. CF-World는 각 시나리오를 세 가지 점진적 수준, 즉 일반적인 세계 지식 하의 사실적 생성, 직접적인 시각적 지침이 있는 명시적 반사실적 생성, 그리고 변경된 규칙으로부터 인과적 추론이 필요한 암시적 반사실적 생성으로 구성한다. 우리는 시각 언어 모델(VLM) 기반 평가기(CF-Eval)를 사용하여 오픈소스 및 폐쇄형 T2I 모델을 모두 평가한다. 또한, 모델이 고착된 현실 세계의 사전 지식을 극복하는 능력을 측정하는 사전 확률 저항률(PRR)과, 모델이 명시적인 시각적 단서 없이 추론에 의존하는 반사실적 생성을 유지할 수 있는지 평가하는 추론 유지율(RRR)이라는 두 가지 지표를 도입한다. 실험 결과, 모든 모델이 사실적 설정에서 반사실적 설정으로 갈수록 급격한 성능 저하를 보였다. 추가 분석에 따르면 이러한 실패는 현재 T2I 모델이 세계 지식과 시각적 외양을 밀접하게 결합된 패턴으로 인코딩하기 때문에 발생한다. 결과적으로, 이러한 모델이 훈련 데이터 내 빈번한 시각적 공기(共起)에 크게 의존함에 따라, 반사실적 세계를 렌더링해야 할 때 익숙한 상식적 사전 지식으로 회귀하게 된다.

English

Text-to-image (T2I) generation models have achieved remarkable progress in producing visually realistic images from natural language prompts. Yet it remains unclear whether their success reflects genuine causal understanding or sophisticated pattern matching over visual-textual correlations. Inspired by Russell's inductivist turkey, we introduce Counterfactual-World (CF-World), a counterfactual benchmark designed to investigate whether text-to-image models can generate images under rules that systematically contradict real-world priors. CF-World organizes each scenario into three progressive levels: factual generation under ordinary world knowledge, explicit counterfactual generation with direct visual instructions, and implicit counterfactual generation requiring causal deduction from altered rules. We evaluate both open-source and closed-source T2I models using a Vision Language Model (VLM)-based evaluator (CF-Eval). Furthermore, we introduce two metrics: Prior Resistance Rate (PRR), which measures a model's ability to overcome entrenched real-world priors, and Reasoning Retention Rate (RRR), which assesses whether models can maintain reasoning-dependent counterfactual generation without explicit visual cues. Experiments show that all models exhibit sharp degradation from factual to counterfactual settings. Further analyses suggest that these failures arise because current T2I models encode world knowledge and visual appearances as tightly coupled patterns. Consequently, their heavy reliance on frequent visual co-occurrences within the training data forces them to default to familiar commonsense priors when tasked with rendering counterfactual worlds.