드림월드: 비디오 생성에서의 통합 세계 모델링

초록

비디오 생성 분야에서 인상적인 진전이 있었음에도 불구하고, 기존 모델들은 표면적인 타당성에 머물러 있으며 세계에 대한 일관되고 통합된 이해가 부족합니다. 기존 접근법들은 일반적으로 세계 관련 지식의 단일 형태만 통합하거나, 추가 지식을 도입하기 위해 경직된 정렬 전략에 의존해왔습니다. 그러나 단일 세계 지식을 정렬하는 것은 여러 이질적 차원(예: 물리적 상식, 3차원 및 시간적 일관성)을 함께 모델링해야 하는 세계 모델을 구성하기에는 불충분합니다. 이러한 한계를 해결하기 위해 우리는 상호 보완적인 세계 지식을 비디오 생성기에 통합하는 통합 프레임워크인 DreamWorld를 소개합니다. 이는 Joint World Modeling Paradigm을 통해 시간적 역학, 공간 기하학 및 의미론적 일관성을 포착하기 위해 비디오 픽셀과 파운데이션 모델의 특징을 함께 예측합니다. 그러나 이러한 이질적 목표들을 단순히 최적화하면 시각적 불안정성과 시간적 깜빡임이 발생할 수 있습니다. 이 문제를 완화하기 위해 우리는 훈련 과정에서 세계 수준의 제약 조건을 점진적으로 규제하는 Consistent Constraint Annealing(CCA)과 추론 시 학습된 세계 사전 지식을 강화하는 Multi-Source Inner-Guidance를 제안합니다. 광범위한 평가 결과, DreamWorld가 세계 일관성을 향상시키며 VBench에서 Wan2.1보다 2.26점 높은 성능을 보이는 것으로 나타났습니다. 코드는 https://github.com/ABU121111/DreamWorld{Github}에서 공개될 예정입니다.

English

Despite impressive progress in video generation, existing models remain limited to surface-level plausibility, lacking a coherent and unified understanding of the world. Prior approaches typically incorporate only a single form of world-related knowledge or rely on rigid alignment strategies to introduce additional knowledge. However, aligning the single world knowledge is insufficient to constitute a world model that requires jointly modeling multiple heterogeneous dimensions (e.g., physical commonsense, 3D and temporal consistency). To address this limitation, we introduce DreamWorld, a unified framework that integrates complementary world knowledge into video generators via a Joint World Modeling Paradigm, jointly predicting video pixels and features from foundation models to capture temporal dynamics, spatial geometry, and semantic consistency. However, naively optimizing these heterogeneous objectives can lead to visual instability and temporal flickering. To mitigate this issue, we propose Consistent Constraint Annealing (CCA) to progressively regulate world-level constraints during training, and Multi-Source Inner-Guidance to enforce learned world priors at inference. Extensive evaluations show that DreamWorld improves world consistency, outperforming Wan2.1 by 2.26 points on VBench. Code will be made publicly available at https://github.com/ABU121111/DreamWorld{mypink{Github}}.

드림월드: 비디오 생성에서의 통합 세계 모델링

DreamWorld: Unified World Modeling in Video Generation

초록

Support