潜在粒子世界模型：面向对象中心的随机动力学自监督建模

摘要

我们提出隐式粒子世界模型（LPWM），这是一种可扩展至真实世界多目标数据集并适用于决策任务的自监督对象中心化世界模型。LPWM能够直接从视频数据中自主发现关键点、边界框和物体掩码，从而在无监督条件下学习丰富的场景分解表示。该架构完全基于视频端到端训练，支持对动作、语言和图像目标进行灵活的条件控制。通过新型隐式动作模块，LPWM实现了随机粒子动力学的建模，并在多样化的真实世界与合成数据集上取得了最先进的性能。除随机视频建模外，LPWM还可直接应用于决策任务（包括目标条件模仿学习），相关验证已在论文中展示。代码、数据、预训练模型及视频推演结果详见：https://taldatech.github.io/lpwm-web

English

We introduce Latent Particle World Model (LPWM), a self-supervised object-centric world model scaled to real-world multi-object datasets and applicable in decision-making. LPWM autonomously discovers keypoints, bounding boxes, and object masks directly from video data, enabling it to learn rich scene decompositions without supervision. Our architecture is trained end-to-end purely from videos and supports flexible conditioning on actions, language, and image goals. LPWM models stochastic particle dynamics via a novel latent action module and achieves state-of-the-art results on diverse real-world and synthetic datasets. Beyond stochastic video modeling, LPWM is readily applicable to decision-making, including goal-conditioned imitation learning, as we demonstrate in the paper. Code, data, pre-trained models and video rollouts are available: https://taldatech.github.io/lpwm-web