几何约束：融合视频扩散与三维表征，实现一致的世界建模

摘要

视频本质上是对动态三维世界的二维投影。然而，我们的分析表明，仅基于原始视频数据训练的视频扩散模型往往难以在其学习到的表示中捕捉到有意义的几何感知结构。为了弥合视频扩散模型与物理世界底层三维特性之间的差距，我们提出了几何强制（Geometry Forcing），这是一种简单而有效的方法，旨在促使视频扩散模型内化潜在的三维表示。我们的核心洞见是通过将模型的中间表示与预训练的几何基础模型的特征对齐，引导其朝向几何感知结构发展。为此，我们引入了两种互补的对齐目标：角度对齐（Angular Alignment），通过余弦相似度强制方向一致性；以及尺度对齐（Scale Alignment），通过从归一化的扩散表示回归未归一化的几何特征，保留与尺度相关的信息。我们在相机视角条件和动作条件视频生成任务上评估了几何强制方法。实验结果表明，相较于基线方法，我们的方法显著提升了视觉质量和三维一致性。项目页面：https://GeometryForcing.github.io。

English

Videos inherently represent 2D projections of a dynamic 3D world. However, our analysis suggests that video diffusion models trained solely on raw video data often fail to capture meaningful geometric-aware structure in their learned representations. To bridge this gap between video diffusion models and the underlying 3D nature of the physical world, we propose Geometry Forcing, a simple yet effective method that encourages video diffusion models to internalize latent 3D representations. Our key insight is to guide the model's intermediate representations toward geometry-aware structure by aligning them with features from a pretrained geometric foundation model. To this end, we introduce two complementary alignment objectives: Angular Alignment, which enforces directional consistency via cosine similarity, and Scale Alignment, which preserves scale-related information by regressing unnormalized geometric features from normalized diffusion representation. We evaluate Geometry Forcing on both camera view-conditioned and action-conditioned video generation tasks. Experimental results demonstrate that our method substantially improves visual quality and 3D consistency over the baseline methods. Project page: https://GeometryForcing.github.io.

几何约束：融合视频扩散与三维表征，实现一致的世界建模

Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

摘要

Support