因果JEPA:通过对象级潜在干预学习世界模型
Causal-JEPA: Learning World Models through Object-Level Latent Interventions
February 11, 2026
作者: Heejeong Nam, Quentin Le Lidec, Lucas Maes, Yann LeCun, Randall Balestriero
cs.AI
摘要
世界模型需要强大的关系理解能力来支撑预测、推理与控制。虽然以物体为中心的表示提供了有效的抽象方式,但仅凭其不足以捕捉交互依赖的动态特性。为此,我们提出C-JEPA模型——一种简单灵活、以物体为中心的世界模型,它将掩码联合嵌入预测从图像块扩展至物体中心表示。通过实施需要从其他物体推断目标物体状态的对象级掩码机制,C-JEPA能产生具有类反事实效果的潜在干预,并阻断捷径解决方案,从而使交互推理成为必要环节。实验表明,C-JEPA在视觉问答任务中实现持续性能提升,其中反事实推理能力较未采用对象级掩码的相同架构绝对提升约20%。在智能体控制任务中,C-JEPA仅需基于块的世界模型所需潜在输入特征的1%,即可实现相当的性能水平,显著提升规划效率。最后,我们通过形式化分析证明对象级掩码能通过潜在干预引发因果归纳偏置。代码已开源:https://github.com/galilai-group/cjepa。
English
World models require robust relational understanding to support prediction, reasoning, and control. While object-centric representations provide a useful abstraction, they are not sufficient to capture interaction-dependent dynamics. We therefore propose C-JEPA, a simple and flexible object-centric world model that extends masked joint embedding prediction from image patches to object-centric representations. By applying object-level masking that requires an object's state to be inferred from other objects, C-JEPA induces latent interventions with counterfactual-like effects and prevents shortcut solutions, making interaction reasoning essential. Empirically, C-JEPA leads to consistent gains in visual question answering, with an absolute improvement of about 20\% in counterfactual reasoning compared to the same architecture without object-level masking. On agent control tasks, C-JEPA enables substantially more efficient planning by using only 1\% of the total latent input features required by patch-based world models, while achieving comparable performance. Finally, we provide a formal analysis demonstrating that object-level masking induces a causal inductive bias via latent interventions. Our code is available at https://github.com/galilai-group/cjepa.