ChatPaper.aiChatPaper

UniVideo:视频理解、生成与编辑的统一框架

UniVideo: Unified Understanding, Generation, and Editing for Videos

October 9, 2025
作者: Cong Wei, Quande Liu, Zixuan Ye, Qiulin Wang, Xintao Wang, Pengfei Wan, Kun Gai, Wenhu Chen
cs.AI

摘要

统一多模态模型在多模态内容生成与编辑方面已展现出显著成果,但主要局限于图像领域。本研究提出UniVideo,一个将统一建模扩展至视频领域的通用框架。UniVideo采用双流设计,结合了用于指令理解的多模态大语言模型(MLLM)与用于视频生成的多模态DiT(MMDiT),确保在准确解析复杂多模态指令的同时保持视觉一致性。基于此架构,UniVideo将多样化的视频生成与编辑任务统一于单一多模态指令范式下,并对其进行联合训练。大量实验表明,UniVideo在文本/图像到视频生成、上下文视频生成及上下文视频编辑任务中,达到或超越了当前最先进的特定任务基线。尤为突出的是,UniVideo的统一设计实现了两种形式的泛化能力:其一,通过单一指令整合多项功能,支持任务组合,如将编辑与风格迁移相结合;其二,即便未针对自由形式视频编辑进行专门训练,UniVideo也能将其从大规模图像编辑数据中习得的编辑能力迁移至此场景,处理诸如绿幕抠像或视频内材质替换等未见指令。此外,UniVideo还支持基于视觉提示的视频生成,其中MLLM解析视觉提示并在合成过程中指导MMDiT。为推动未来研究,我们将公开模型与代码。
English
Unified multimodal models have shown promising results in multimodal content generation and editing but remain largely limited to the image domain. In this work, we present UniVideo, a versatile framework that extends unified modeling to the video domain. UniVideo adopts a dual-stream design, combining a Multimodal Large Language Model (MLLM) for instruction understanding with a Multimodal DiT (MMDiT) for video generation. This design enables accurate interpretation of complex multimodal instructions while preserving visual consistency. Built on this architecture, UniVideo unifies diverse video generation and editing tasks under a single multimodal instruction paradigm and is jointly trained across them. Extensive experiments demonstrate that UniVideo matches or surpasses state-of-the-art task-specific baselines in text/image-to-video generation, in-context video generation and in-context video editing. Notably, the unified design of UniVideo enables two forms of generalization. First, UniVideo supports task composition, such as combining editing with style transfer, by integrating multiple capabilities within a single instruction. Second, even without explicit training on free-form video editing, UniVideo transfers its editing capability from large-scale image editing data to this setting, handling unseen instructions such as green-screening characters or changing materials within a video. Beyond these core capabilities, UniVideo also supports visual-prompt-based video generation, where the MLLM interprets visual prompts and guides the MMDiT during synthesis. To foster future research, we will release our model and code.
PDF492October 10, 2025