ChatPaper.aiChatPaper

未曾目睹即可推断组合式4D场景

Inferring Compositional 4D Scenes without Ever Seeing One

December 4, 2025
作者: Ahmet Berke Gokmen, Ajad Chhatkuli, Luc Van Gool, Danda Pani Paudel
cs.AI

摘要

现实世界中的场景通常由若干静态与动态物体构成。尽管捕捉这些物体在自然状态下的四维结构、组合关系及时空构型极具研究价值,但其实现难度同样巨大。因此现有研究多聚焦于单物体分析,并依赖特定类别的参数化动态物体形状模型。这种方法不仅受限于已建模的物体类别,还可能导致场景构型不一致。我们提出COM4D(组合式四维重建)方法,仅需静态多物体或动态单物体的监督信号,即可持续联合预测三维/四维物体的结构与时空构型。通过精心设计对二维视频输入实施时空注意力机制训练,我们将学习过程解耦为物体组合关系学习与单物体时序动态学习,从而完全避免对四维组合训练数据的依赖。在推理阶段,我们提出的注意力混合机制能融合这些独立学习的注意力权重,且无需任何四维组合示例。通过交替进行空间推理与时间推理,COM4D可直接从单目视频中重建出完整且具有持续性的多物体交互四维场景。此外,尽管采用纯数据驱动方式,COM4D在现有的四维物体重建与组合式三维重建等独立任务中仍取得了最先进的结果。
English
Scenes in the real world are often composed of several static and dynamic objects. Capturing their 4-dimensional structures, composition and spatio-temporal configuration in-the-wild, though extremely interesting, is equally hard. Therefore, existing works often focus on one object at a time, while relying on some category-specific parametric shape model for dynamic objects. This can lead to inconsistent scene configurations, in addition to being limited to the modeled object categories. We propose COM4D (Compositional 4D), a method that consistently and jointly predicts the structure and spatio-temporal configuration of 4D/3D objects using only static multi-object or dynamic single object supervision. We achieve this by a carefully designed training of spatial and temporal attentions on 2D video input. The training is disentangled into learning from object compositions on the one hand, and single object dynamics throughout the video on the other, thus completely avoiding reliance on 4D compositional training data. At inference time, our proposed attention mixing mechanism combines these independently learned attentions, without requiring any 4D composition examples. By alternating between spatial and temporal reasoning, COM4D reconstructs complete and persistent 4D scenes with multiple interacting objects directly from monocular videos. Furthermore, COM4D provides state-of-the-art results in existing separate problems of 4D object and composed 3D reconstruction despite being purely data-driven.
PDF22December 17, 2025