幾何強制：融合視頻擴散與三維表示以實現一致的世界建模

摘要

視頻本質上是動態三維世界的二維投影。然而，我們的分析表明，僅在原始視頻數據上訓練的視頻擴散模型往往無法在其學習到的表徵中捕捉到有意義的幾何感知結構。為了彌合視頻擴散模型與物理世界底層三維特性之間的差距，我們提出了幾何約束（Geometry Forcing），這是一種簡單而有效的方法，旨在促使視頻擴散模型內化潛在的三維表徵。我們的核心洞見是通過將模型的中間表徵與預訓練的幾何基礎模型的特徵對齊，來引導其朝向幾何感知結構發展。為此，我們引入了兩個互補的對齊目標：角度對齊（Angular Alignment），通過餘弦相似度強制方向一致性；以及尺度對齊（Scale Alignment），通過從歸一化的擴散表徵回歸未歸一化的幾何特徵來保留與尺度相關的信息。我們在相機視角條件和動作條件的視頻生成任務上評估了幾何約束。實驗結果表明，與基線方法相比，我們的方法顯著提升了視覺質量和三維一致性。項目頁面：https://GeometryForcing.github.io。

English

Videos inherently represent 2D projections of a dynamic 3D world. However, our analysis suggests that video diffusion models trained solely on raw video data often fail to capture meaningful geometric-aware structure in their learned representations. To bridge this gap between video diffusion models and the underlying 3D nature of the physical world, we propose Geometry Forcing, a simple yet effective method that encourages video diffusion models to internalize latent 3D representations. Our key insight is to guide the model's intermediate representations toward geometry-aware structure by aligning them with features from a pretrained geometric foundation model. To this end, we introduce two complementary alignment objectives: Angular Alignment, which enforces directional consistency via cosine similarity, and Scale Alignment, which preserves scale-related information by regressing unnormalized geometric features from normalized diffusion representation. We evaluate Geometry Forcing on both camera view-conditioned and action-conditioned video generation tasks. Experimental results demonstrate that our method substantially improves visual quality and 3D consistency over the baseline methods. Project page: https://GeometryForcing.github.io.