ChatPaper.aiChatPaper

影片基礎模型編碼了多少3D資訊?

How Much 3D Do Video Foundation Models Encode?

December 23, 2025
作者: Zixuan Huang, Xiang Li, Zhaoyang Lv, James M. Rehg
cs.AI

摘要

影片是3D世界的連續二維投影。在大量影片數據上訓練後,全局3D理解能力是否會自然湧現?我們透過量化現有影片基礎模型(VidFMs)的3D理解能力來研究此問題,這些模型已在海量影片數據上進行預訓練。我們提出首個模型無關框架,透過淺層讀取器從模型特徵中估算多種3D屬性,從而衡量各類VidFMs的3D認知能力。我們的研究在多個維度上揭示了VidFMs具備有意義的3D認知表現。特別值得注意的是,研究顯示最先進的影片生成模型即使未經任何3D數據訓練,仍能展現對3D物體與場景的深刻理解,其理解程度甚至可超越專門針對3D任務訓練的大型專家模型。這些發現連同對主流VidFMs的3D基準測試結果,為構建可擴展的3D模型提供了重要洞見。
English
Videos are continuous 2D projections of 3D worlds. After training on large video data, will global 3D understanding naturally emerge? We study this by quantifying the 3D understanding of existing Video Foundation Models (VidFMs) pretrained on vast video data. We propose the first model-agnostic framework that measures the 3D awareness of various VidFMs by estimating multiple 3D properties from their features via shallow read-outs. Our study presents meaningful findings regarding the 3D awareness of VidFMs on multiple axes. In particular, we show that state-of-the-art video generation models exhibit a strong understanding of 3D objects and scenes, despite not being trained on any 3D data. Such understanding can even surpass that of large expert models specifically trained for 3D tasks. Our findings, together with the 3D benchmarking of major VidFMs, provide valuable observations for building scalable 3D models.
PDF42December 27, 2025