기하학적 일관성을 위한 정량적 비디오 세계 모델 평가

초록

생성형 비디오 모델은 점차 암시적 세계 모델로 연구되고 있지만, 이들이 물리적으로 타당한 3차원 구조와 움직임을 생성하는지 평가하는 것은 여전히 어렵다. 기존의 대부분의 비디오 평가 파이프라인은 인간 판단이나 학습된 평가자에 크게 의존하는데, 이는 주관적일 수 있고 기하학적 오류에 대한 진단력이 약하다. 우리는 생성된 비디오의 기하학적 일관성을 감사하기 위한 정량적 프레임워크인 PDI-Bench(원근 왜곡 지수)를 소개한다. 생성된 클립이 주어지면, 분할 및 점 추적(예: SAM 2, MegaSaM, CoTracker3)을 통해 객체 중심 관측을 얻고, 단안 재구성을 통해 이를 3차원 세계 공간 좌표로 변환한 후, 세 가지 오류 차원(스케일-깊이 정렬, 3차원 움직임 일관성, 3차원 구조 강성)을 포착하는 투영 기하학 잔차 집합을 계산한다. 체계적 평가를 지원하기 위해, 이러한 기하학적 제약을 시험하기 위해 설계된 다양한 시나리오를 포함하는 PDI-Dataset을 구축했다. 최첨단 비디오 생성기들에 대해, PDI는 일반적인 지각적 지표로는 포착되지 않는 일관된 기하학 특이적 오류 모드를 밝혀내며, 물리 기반 비디오 생성 및 물리적 세계 모델을 향한 진전을 위한 진단 신호를 제공한다. 우리의 코드와 데이터셋은 https://pdi-bench.github.io/에서 확인할 수 있다.

English

Generative video models are increasingly studied as implicit world models, yet evaluating whether they produce physically plausible 3D structure and motion remains challenging. Most existing video evaluation pipelines rely heavily on human judgment or learned graders, which can be subjective and weakly diagnostic for geometric failures. We introduce PDI-Bench (Perspective Distortion Index), a quantitative framework for auditing geometric coherence in generated videos. Given a generated clip, we obtain object-centric observations via segmentation and point tracking (e.g., SAM 2, MegaSaM, and CoTracker3), lift them to 3D world-space coordinates via monocular reconstruction, and compute a set of projective-geometry residuals capturing three failure dimensions: scale-depth alignment, 3D motion consistency, and 3D structural rigidity. To support systematic evaluation, we build PDI-Dataset, covering diverse scenarios designed to stress these geometric constraints. Across state-of-the-art video generators, PDI reveals consistent geometry-specific failure modes that are not captured by common perceptual metrics, and provides a diagnostic signal for progress toward physically grounded video generation and physical world model. Our code and dataset can be found at https://pdi-bench.github.io/.