テキスト画像生成モデルは帰納主義の七面鳥か？：因果推論を評価する反事実ベンチマーク

要旨

テキストから画像を生成する（T2I）モデルは、自然言語のプロンプトから視覚的に現実的な画像を生成する点で顕著な進歩を遂げている。しかしながら、その成功が真の因果的理解を反映しているのか、それとも視覚・テキスト間の相関にわたる洗練されたパターンマッチングに過ぎないのかは、依然として明らかではない。ラッセルの帰納主義の七面鳥に着想を得て、我々はCounterfactual-World（CF-World）を導入する。これは、テキストから画像を生成するモデルが、現実世界の事前知識と体系的に矛盾するルールの下で画像を生成できるかどうかを調査するために設計された反事実ベンチマークである。CF-Worldは、各シナリオを以下の三段階のレベルに整理する：通常の世界知識に基づく事実生成、直接的な視覚指示による明示的反事実生成、および変更されたルールからの因果推論を必要とする暗黙的反事実生成である。我々は、Vision Language Model（VLM）ベースの評価器（CF-Eval）を用いて、オープンソースおよびクローズドソースのT2Iモデルを評価する。さらに、我々は二つの指標を導入する：固定化された現実世界の事前知識を克服するモデルの能力を測定するPrior Resistance Rate（PRR）と、明示的な視覚的手がかりなしに推論に依存した反事実生成を維持できるかどうかを評価するReasoning Retention Rate（RRR）である。実験結果は、すべてのモデルが事実設定から反事実設定への急激な性能低下を示すことを明らかにしている。さらなる分析は、これらの失敗は、現在のT2Iモデルが世界知識と視覚的外観を密接に結合したパターンとして符号化していることに起因することを示唆している。その結果、訓練データ内の頻繁な視覚的共起への過度の依存により、反事実世界を描画するタスクにおいて、慣れ親しんだ常識的な事前知識にデフォルトせざるを得なくなる。

English

Text-to-image (T2I) generation models have achieved remarkable progress in producing visually realistic images from natural language prompts. Yet it remains unclear whether their success reflects genuine causal understanding or sophisticated pattern matching over visual-textual correlations. Inspired by Russell's inductivist turkey, we introduce Counterfactual-World (CF-World), a counterfactual benchmark designed to investigate whether text-to-image models can generate images under rules that systematically contradict real-world priors. CF-World organizes each scenario into three progressive levels: factual generation under ordinary world knowledge, explicit counterfactual generation with direct visual instructions, and implicit counterfactual generation requiring causal deduction from altered rules. We evaluate both open-source and closed-source T2I models using a Vision Language Model (VLM)-based evaluator (CF-Eval). Furthermore, we introduce two metrics: Prior Resistance Rate (PRR), which measures a model's ability to overcome entrenched real-world priors, and Reasoning Retention Rate (RRR), which assesses whether models can maintain reasoning-dependent counterfactual generation without explicit visual cues. Experiments show that all models exhibit sharp degradation from factual to counterfactual settings. Further analyses suggest that these failures arise because current T2I models encode world knowledge and visual appearances as tightly coupled patterns. Consequently, their heavy reliance on frequent visual co-occurrences within the training data forces them to default to familiar commonsense priors when tasked with rendering counterfactual worlds.