幾何一致性之量化影片世界模型評估

摘要

生成式视频模型作為隱式世界模型的研究日益增多，然而評估其能否產生物理上合理的3D結構與運動仍是一大挑戰。現有大多數影片評估流程高度依賴人工判斷或學習型評分器，這些方式可能帶有主觀性，且對幾何錯誤的診斷能力不足。我們提出PDI-Bench（透視失真指數），這是一個用於審核生成影片幾何一致性的量化框架。針對給定的生成片段，我們透過分割與點追蹤（如SAM 2、MegaSaM和CoTracker3）獲取物體中心觀察，經由單眼重建將其提升至3D世界空間座標，並計算一組射影幾何殘差，以捕捉三個失效維度：尺度-深度對齊、3D運動一致性，以及3D結構剛性。為支援系統性評估，我們建構了PDI-Dataset，涵蓋多種旨在挑戰這些幾何限制的場景。在最先進的影片生成器中，PDI揭示了常見感知指標無法捕捉的、具幾何特異性的失效模式，並為朝向物理基礎影片生成與物理世界模型的進展提供了診斷信號。我們的程式碼與資料集可於 https://pdi-bench.github.io/ 取得。

English

Generative video models are increasingly studied as implicit world models, yet evaluating whether they produce physically plausible 3D structure and motion remains challenging. Most existing video evaluation pipelines rely heavily on human judgment or learned graders, which can be subjective and weakly diagnostic for geometric failures. We introduce PDI-Bench (Perspective Distortion Index), a quantitative framework for auditing geometric coherence in generated videos. Given a generated clip, we obtain object-centric observations via segmentation and point tracking (e.g., SAM 2, MegaSaM, and CoTracker3), lift them to 3D world-space coordinates via monocular reconstruction, and compute a set of projective-geometry residuals capturing three failure dimensions: scale-depth alignment, 3D motion consistency, and 3D structural rigidity. To support systematic evaluation, we build PDI-Dataset, covering diverse scenarios designed to stress these geometric constraints. Across state-of-the-art video generators, PDI reveals consistent geometry-specific failure modes that are not captured by common perceptual metrics, and provides a diagnostic signal for progress toward physically grounded video generation and physical world model. Our code and dataset can be found at https://pdi-bench.github.io/.