ChatPaper.aiChatPaper

视频基础模型编码了多少三维信息?

How Much 3D Do Video Foundation Models Encode?

December 23, 2025
作者: Zixuan Huang, Xiang Li, Zhaoyang Lv, James M. Rehg
cs.AI

摘要

视频是三维世界在二维平面上的连续投影。在大量视频数据上训练后,全局三维理解能力是否会自然涌现?我们通过量化现有视频基础模型(VidFM)的三维理解能力展开研究,这些模型已在海量视频数据上完成预训练。我们提出了首个模型无关的评估框架,通过浅层读出器从模型特征中估计多种三维属性,从而衡量不同VidFM的三维感知能力。本研究从多个维度揭示了关于VidFM三维感知能力的重要发现。特别值得注意的是,研究显示最先进的视频生成模型虽未经过任何三维数据训练,却展现出对三维物体与场景的深刻理解。这种理解能力甚至能超越专门针对三维任务训练的大型专家模型。我们的发现结合对主流VidFM的三维基准测试,为构建可扩展的三维模型提供了宝贵洞见。
English
Videos are continuous 2D projections of 3D worlds. After training on large video data, will global 3D understanding naturally emerge? We study this by quantifying the 3D understanding of existing Video Foundation Models (VidFMs) pretrained on vast video data. We propose the first model-agnostic framework that measures the 3D awareness of various VidFMs by estimating multiple 3D properties from their features via shallow read-outs. Our study presents meaningful findings regarding the 3D awareness of VidFMs on multiple axes. In particular, we show that state-of-the-art video generation models exhibit a strong understanding of 3D objects and scenes, despite not being trained on any 3D data. Such understanding can even surpass that of large expert models specifically trained for 3D tasks. Our findings, together with the 3D benchmarking of major VidFMs, provide valuable observations for building scalable 3D models.
PDF42December 27, 2025