因果JEPA：通过对象级潜在干预学习世界模型

摘要

世界模型需要强大的关系理解能力来支撑预测、推理与控制。虽然以对象为中心的表征提供了有效的抽象方式，但尚不足以捕捉交互依赖的动态特性。为此，我们提出C-JEPA模型——一种简单灵活、以对象为中心的世界模型，它将基于图像块的掩码联合嵌入预测扩展至对象中心表征。通过实施对象级掩码机制（要求通过其他对象推断目标对象状态），C-JEPA在潜在空间产生具有类反事实效果的干预，杜绝捷径解决方案，使交互推理成为必要环节。实验表明，C-JEPA在视觉问答任务中持续提升性能，其中反事实推理能力较未采用对象级掩码的相同架构绝对提升约20%。在智能体控制任务中，C-JEPA仅需基于图像块的世界模型1%的潜在输入特征量，即可实现相当的性能，且规划效率显著提升。最后，我们通过形式化分析证明对象级掩码通过潜在干预机制引入了因果归纳偏置。代码已开源：https://github.com/galilai-group/cjepa。

English

World models require robust relational understanding to support prediction, reasoning, and control. While object-centric representations provide a useful abstraction, they are not sufficient to capture interaction-dependent dynamics. We therefore propose C-JEPA, a simple and flexible object-centric world model that extends masked joint embedding prediction from image patches to object-centric representations. By applying object-level masking that requires an object's state to be inferred from other objects, C-JEPA induces latent interventions with counterfactual-like effects and prevents shortcut solutions, making interaction reasoning essential. Empirically, C-JEPA leads to consistent gains in visual question answering, with an absolute improvement of about 20\% in counterfactual reasoning compared to the same architecture without object-level masking. On agent control tasks, C-JEPA enables substantially more efficient planning by using only 1\% of the total latent input features required by patch-based world models, while achieving comparable performance. Finally, we provide a formal analysis demonstrating that object-level masking induces a causal inductive bias via latent interventions. Our code is available at https://github.com/galilai-group/cjepa.