UnityVideo:統一多模態多任務學習框架——提升世界感知影片生成的創新方法
UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation
December 8, 2025
作者: Jiehui Huang, Yuechen Zhang, Xu He, Yuan Gao, Zhi Cen, Bin Xia, Yan Zhou, Xin Tao, Pengfei Wan, Jiaya Jia
cs.AI
摘要
近期影片生成模型展現出令人印象深刻的合成能力,但受限於單模態條件輸入,制約了其對整體世界的理解能力。此問題根源在於跨模態互動不足,以及用於全面世界知識表徵的模態多樣性有限。為突破這些限制,我們提出UnityVideo——一個具備世界感知能力的統一影片生成框架,能跨越多種模態(分割遮罩、人體骨架、DensePose、光流和深度圖)與訓練範式進行聯合學習。我們的方法包含兩大核心組件:(1)動態噪聲注入技術,用於統一異質性訓練範式;(2)具備情境學習能力的模態切換器,透過模組化參數與上下文學習實現統一處理。我們構建了包含130萬樣本的大規模統一數據集。經由聯合優化,UnityVideo能加速收斂並顯著提升對未見數據的零樣本泛化能力。實驗證明,UnityVideo在影片品質、連貫性以及與物理世界約束的契合度方面均實現優異表現。程式碼與數據請參見:https://github.com/dvlab-research/UnityVideo
English
Recent video generation models demonstrate impressive synthesis capabilities but remain limited by single-modality conditioning, constraining their holistic world understanding. This stems from insufficient cross-modal interaction and limited modal diversity for comprehensive world knowledge representation. To address these limitations, we introduce UnityVideo, a unified framework for world-aware video generation that jointly learns across multiple modalities (segmentation masks, human skeletons, DensePose, optical flow, and depth maps) and training paradigms. Our approach features two core components: (1) dynamic noising to unify heterogeneous training paradigms, and (2) a modality switcher with an in-context learner that enables unified processing via modular parameters and contextual learning. We contribute a large-scale unified dataset with 1.3M samples. Through joint optimization, UnityVideo accelerates convergence and significantly enhances zero-shot generalization to unseen data. We demonstrate that UnityVideo achieves superior video quality, consistency, and improved alignment with physical world constraints. Code and data can be found at: https://github.com/dvlab-research/UnityVideo