ChatPaper.aiChatPaper

UnityVideo:统一多模态多任务学习框架,提升世界感知视频生成能力

UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation

December 8, 2025
作者: Jiehui Huang, Yuechen Zhang, Xu He, Yuan Gao, Zhi Cen, Bin Xia, Yan Zhou, Xin Tao, Pengfei Wan, Jiaya Jia
cs.AI

摘要

近期视频生成模型展现出令人印象深刻的合成能力,但仍受限于单模态条件输入,这制约了其对整体世界的理解能力。该局限性源于跨模态交互的不足以及用于全面世界知识表征的模态多样性缺失。为解决这些问题,我们提出UnityVideo——一个面向世界感知视频生成的统一框架,能够跨多种模态(分割掩码、人体骨架、DensePose、光流和深度图)及训练范式进行联合学习。我们的方法包含两个核心组件:(1)动态噪声注入以统一异构训练范式;(2)带有上下文学习器的模态切换器,通过模块化参数和情境学习实现统一处理。我们贡献了包含130万样本的大规模统一数据集。通过联合优化,UnityVideo加速了模型收敛,并显著提升了对未见数据的零样本泛化能力。实验表明,UnityVideo在视频质量、连贯性以及与物理世界约束的对齐程度上均实现卓越表现。代码与数据详见:https://github.com/dvlab-research/UnityVideo
English
Recent video generation models demonstrate impressive synthesis capabilities but remain limited by single-modality conditioning, constraining their holistic world understanding. This stems from insufficient cross-modal interaction and limited modal diversity for comprehensive world knowledge representation. To address these limitations, we introduce UnityVideo, a unified framework for world-aware video generation that jointly learns across multiple modalities (segmentation masks, human skeletons, DensePose, optical flow, and depth maps) and training paradigms. Our approach features two core components: (1) dynamic noising to unify heterogeneous training paradigms, and (2) a modality switcher with an in-context learner that enables unified processing via modular parameters and contextual learning. We contribute a large-scale unified dataset with 1.3M samples. Through joint optimization, UnityVideo accelerates convergence and significantly enhances zero-shot generalization to unseen data. We demonstrate that UnityVideo achieves superior video quality, consistency, and improved alignment with physical world constraints. Code and data can be found at: https://github.com/dvlab-research/UnityVideo
PDF143December 10, 2025