UniVideo：ビデオの統合的理解、生成、編集

要旨

統一マルチモーダルモデルは、マルチモーダルコンテンツの生成と編集において有望な結果を示してきましたが、その適用範囲は主に画像領域に限定されています。本研究では、統一モデリングを動画領域に拡張する汎用フレームワークであるUniVideoを提案します。UniVideoは、命令理解のためのマルチモーダル大規模言語モデル（MLLM）と動画生成のためのマルチモーダルDiT（MMDiT）を組み合わせたデュアルストリーム設計を採用しています。この設計により、複雑なマルチモーダル命令を正確に解釈しつつ、視覚的一貫性を保つことが可能になります。このアーキテクチャを基盤として、UniVideoは多様な動画生成および編集タスクを単一のマルチモーダル命令パラダイムの下で統合し、それらを横断して共同で学習します。大規模な実験により、UniVideoがテキスト/画像から動画への生成、コンテキスト内動画生成、およびコンテキスト内動画編集において、タスク特化型の最先端ベースラインと同等またはそれ以上の性能を発揮することが実証されました。特に、UniVideoの統一設計は2つの形式の汎化を可能にします。第一に、UniVideoは単一の命令内で複数の能力を統合することにより、編集とスタイル転送を組み合わせるなどのタスク合成をサポートします。第二に、自由形式の動画編集に関する明示的な学習がなくても、UniVideoは大規模な画像編集データからその編集能力をこの設定に転移させ、グリーンスクリーン処理や動画内の素材変更などの未見の命令を処理します。これらのコア能力に加えて、UniVideoは視覚的プロンプトに基づく動画生成もサポートしており、MLLMが視覚的プロンプトを解釈し、合成中にMMDiTをガイドします。今後の研究を促進するため、我々はモデルとコードを公開する予定です。

English

Unified multimodal models have shown promising results in multimodal content generation and editing but remain largely limited to the image domain. In this work, we present UniVideo, a versatile framework that extends unified modeling to the video domain. UniVideo adopts a dual-stream design, combining a Multimodal Large Language Model (MLLM) for instruction understanding with a Multimodal DiT (MMDiT) for video generation. This design enables accurate interpretation of complex multimodal instructions while preserving visual consistency. Built on this architecture, UniVideo unifies diverse video generation and editing tasks under a single multimodal instruction paradigm and is jointly trained across them. Extensive experiments demonstrate that UniVideo matches or surpasses state-of-the-art task-specific baselines in text/image-to-video generation, in-context video generation and in-context video editing. Notably, the unified design of UniVideo enables two forms of generalization. First, UniVideo supports task composition, such as combining editing with style transfer, by integrating multiple capabilities within a single instruction. Second, even without explicit training on free-form video editing, UniVideo transfers its editing capability from large-scale image editing data to this setting, handling unseen instructions such as green-screening characters or changing materials within a video. Beyond these core capabilities, UniVideo also supports visual-prompt-based video generation, where the MLLM interprets visual prompts and guides the MMDiT during synthesis. To foster future research, we will release our model and code.

UniVideo：ビデオの統合的理解、生成、編集

UniVideo: Unified Understanding, Generation, and Editing for Videos

要旨

Support