ChatPaper.aiChatPaper

动态宇宙:面向四维世界建模的物理感知多模态框架

DynamicVerse: A Physically-Aware Multimodal Framework for 4D World Modeling

December 2, 2025
作者: Kairun Wen, Yuzhi Huang, Runyu Chen, Hui Zheng, Yunlong Lin, Panwang Pan, Chenxin Li, Wenyan Cong, Jian Zhang, Junbin Lu, Chenguo Lin, Dilin Wang, Zhicheng Yan, Hongyu Xu, Justin Theiss, Yue Huang, Xinghao Ding, Rakesh Ranjan, Zhiwen Fan
cs.AI

摘要

理解动态物理世界——这一以不断演化的三维结构、真实世界运动及带有文本描述的语义内容为特征的核心能力,对于人机交互至关重要,它使得具身智能体能够以类人能力感知并作用于真实环境。然而,现有数据集多源于受限的模拟器,或采用传统运动恢复结构技术进行尺度标注,且描述性标注有限,这制约了基础模型从网络常见的单目视频中准确解析真实世界动态的能力。为弥补这些不足,我们提出DynamicVerse:一个面向动态真实世界视频的物理尺度多模态四维世界建模框架。我们运用大规模视觉、几何与多模态模型来解析公制尺度的静态几何、真实世界动态运动、实例级掩码及整体描述性标注。通过将基于窗口的集束调整与全局优化相结合,我们的方法将长时序真实世界视频转化为全面的四维多模态格式。DynamicVerse提供了大规模数据集,包含来自网络视频的10万+段视频、80万+标注掩码及1000万+帧图像。在视频深度估计、相机位姿估计和相机内参估计三项基准任务的实验评估表明,我们的四维建模方法在捕捉物理尺度测量方面具有卓越性能,其全局精度优于现有方法。
English
Understanding the dynamic physical world, characterized by its evolving 3D structure, real-world motion, and semantic content with textual descriptions, is crucial for human-agent interaction and enables embodied agents to perceive and act within real environments with human-like capabilities. However, existing datasets are often derived from limited simulators or utilize traditional Structurefrom-Motion for up-to-scale annotation and offer limited descriptive captioning, which restricts the capacity of foundation models to accurately interpret real-world dynamics from monocular videos, commonly sourced from the internet. To bridge these gaps, we introduce DynamicVerse, a physical-scale, multimodal 4D world modeling framework for dynamic real-world video. We employ large vision, geometric, and multimodal models to interpret metric-scale static geometry, real-world dynamic motion, instance-level masks, and holistic descriptive captions. By integrating window-based Bundle Adjustment with global optimization, our method converts long real-world video sequences into a comprehensive 4D multimodal format. DynamicVerse delivers a large-scale dataset consisting of 100K+ videos with 800K+ annotated masks and 10M+ frames from internet videos. Experimental evaluations on three benchmark tasks, namely video depth estimation, camera pose estimation, and camera intrinsics estimation, demonstrate that our 4D modeling achieves superior performance in capturing physical-scale measurements with greater global accuracy than existing methods.
PDF152December 6, 2025