UniVideo：視頻的統一理解、生成與編輯

摘要

統一的多模態模型在多模態內容生成與編輯方面已展現出令人期待的成果，但這些成果主要侷限於圖像領域。在本研究中，我們提出了UniVideo，這是一個將統一建模擴展至視頻領域的通用框架。UniVideo採用了雙流設計，結合了用於指令理解的多模態大型語言模型（MLLM）和用於視頻生成的多模態DiT（MMDiT）。這一設計使得複雜多模態指令能夠被精確解讀，同時保持視覺一致性。基於此架構，UniVideo將多樣的視頻生成與編輯任務統一在單一的多模態指令範式下，並對這些任務進行聯合訓練。大量實驗表明，UniVideo在文本/圖像到視頻生成、上下文視頻生成及上下文視頻編輯等任務上，匹配甚至超越了特定任務的最新基準。值得注意的是，UniVideo的統一設計促進了兩種形式的泛化能力。首先，UniVideo支持任務組合，例如將編輯與風格轉換結合，通過單一指令整合多種能力。其次，即便未經自由形式視頻編輯的專門訓練，UniVideo也能將其從大規模圖像編輯數據中習得的編輯能力遷移至此場景，處理如綠幕摳像或視頻內材質變換等未見指令。除了這些核心能力，UniVideo還支持基於視覺提示的視頻生成，其中MLLM解讀視覺提示並在合成過程中指導MMDiT。為促進未來研究，我們將公開我們的模型與代碼。

English

Unified multimodal models have shown promising results in multimodal content generation and editing but remain largely limited to the image domain. In this work, we present UniVideo, a versatile framework that extends unified modeling to the video domain. UniVideo adopts a dual-stream design, combining a Multimodal Large Language Model (MLLM) for instruction understanding with a Multimodal DiT (MMDiT) for video generation. This design enables accurate interpretation of complex multimodal instructions while preserving visual consistency. Built on this architecture, UniVideo unifies diverse video generation and editing tasks under a single multimodal instruction paradigm and is jointly trained across them. Extensive experiments demonstrate that UniVideo matches or surpasses state-of-the-art task-specific baselines in text/image-to-video generation, in-context video generation and in-context video editing. Notably, the unified design of UniVideo enables two forms of generalization. First, UniVideo supports task composition, such as combining editing with style transfer, by integrating multiple capabilities within a single instruction. Second, even without explicit training on free-form video editing, UniVideo transfers its editing capability from large-scale image editing data to this setting, handling unseen instructions such as green-screening characters or changing materials within a video. Beyond these core capabilities, UniVideo also supports visual-prompt-based video generation, where the MLLM interprets visual prompts and guides the MMDiT during synthesis. To foster future research, we will release our model and code.

UniVideo：視頻的統一理解、生成與編輯

UniVideo: Unified Understanding, Generation, and Editing for Videos

摘要

Support