文本到图像模型是“归纳主义火鸡”吗？一个用于因果推理的反事实基准

摘要

文本到图像（T2I）生成模型在根据自然语言提示生成视觉逼真图像方面取得了显著进展。然而，尚不清楚其成功是源于真正的因果理解，还是依赖于视觉-文本关联的复杂模式匹配。受罗素归纳火鸡的启发，我们提出了反事实世界（CF-World），这是一个反事实基准，旨在探究文本到图像模型是否能在系统性地违背现实世界先验知识的规则下生成图像。CF-World将每个场景组织为三个递进层次：基于普通世界知识的事实生成、包含直接视觉指令的显式反事实生成，以及需要从规则变更中推理因果的隐式反事实生成。我们使用基于视觉语言模型（VLM）的评估器（CF-Eval）来评估开源和闭源T2I模型。此外，我们引入了两个指标：先验抵抗率（PRR），用于衡量模型克服根深蒂固现实先验的能力；以及推理保持率（RRR），用于评估模型是否能在无显式视觉线索的情况下维持依赖推理的反事实生成。实验表明，所有模型从事实设定过渡到反事实设定时均出现显著性能下降。进一步分析表明，这些失败源于当前T2I模型将世界知识与视觉外观编码为紧密耦合的模式。因此，其过度依赖训练数据中频繁出现的视觉共现模式，导致在需要生成反事实世界任务时，模型默认退回到熟悉的常识先验。

English

Text-to-image (T2I) generation models have achieved remarkable progress in producing visually realistic images from natural language prompts. Yet it remains unclear whether their success reflects genuine causal understanding or sophisticated pattern matching over visual-textual correlations. Inspired by Russell's inductivist turkey, we introduce Counterfactual-World (CF-World), a counterfactual benchmark designed to investigate whether text-to-image models can generate images under rules that systematically contradict real-world priors. CF-World organizes each scenario into three progressive levels: factual generation under ordinary world knowledge, explicit counterfactual generation with direct visual instructions, and implicit counterfactual generation requiring causal deduction from altered rules. We evaluate both open-source and closed-source T2I models using a Vision Language Model (VLM)-based evaluator (CF-Eval). Furthermore, we introduce two metrics: Prior Resistance Rate (PRR), which measures a model's ability to overcome entrenched real-world priors, and Reasoning Retention Rate (RRR), which assesses whether models can maintain reasoning-dependent counterfactual generation without explicit visual cues. Experiments show that all models exhibit sharp degradation from factual to counterfactual settings. Further analyses suggest that these failures arise because current T2I models encode world knowledge and visual appearances as tightly coupled patterns. Consequently, their heavy reliance on frequent visual co-occurrences within the training data forces them to default to familiar commonsense priors when tasked with rendering counterfactual worlds.