CauSight:面向视觉因果发现的上位感知学习
CauSight: Learning to Supersense for Visual Causal Discovery
December 1, 2025
作者: Yize Zhang, Meiqi Chen, Sirui Chen, Bo Peng, Yanxi Zhang, Tianyu Li, Chaochao Lu
cs.AI
摘要
因果思维使人类不仅能理解所见现象,更能洞悉其发生缘由。为在现代人工智能系统中复现这种能力,我们提出了视觉因果发现任务——要求模型在不同场景中推断视觉实体间的因果关系,而非仅感知其存在。为此,我们首先构建了包含32,000余张图像的大规模视觉因果图数据集(VCG-32K),所有图像均标注有实体级因果图;进而开发了CauSight新型视觉语言模型,通过因果感知推理实现视觉因果发现。我们的训练方案整合三大要素:(1)基于VCG-32K的训练数据构建;(2)用于合成推理轨迹的因果思维树(ToCT);(3)结合定制化因果奖励的强化学习以优化推理策略。实验表明,CauSight在视觉因果发现任务上显著超越GPT-4.1,性能提升超三倍(绝对增益达21%)。代码、模型及数据集已在项目页面开源:https://github.com/OpenCausaLab/CauSight。
English
Causal thinking enables humans to understand not just what is seen, but why it happens. To replicate this capability in modern AI systems, we introduce the task of visual causal discovery. It requires models to infer cause-and-effect relations among visual entities across diverse scenarios instead of merely perceiving their presence. To this end, we first construct the Visual Causal Graph dataset (VCG-32K), a large-scale collection of over 32,000 images annotated with entity-level causal graphs, and further develop CauSight, a novel vision-language model to perform visual causal discovery through causally aware reasoning. Our training recipe integrates three components: (1) training data curation from VCG-32K, (2) Tree-of-Causal-Thought (ToCT) for synthesizing reasoning trajectories, and (3) reinforcement learning with a designed causal reward to refine the reasoning policy. Experiments show that CauSight outperforms GPT-4.1 on visual causal discovery, achieving over a threefold performance boost (21% absolute gain). Our code, model, and dataset are fully open-sourced at project page: https://github.com/OpenCausaLab/CauSight.