UniVideo: 비디오를 위한 통합적 이해, 생성 및 편집

초록

통합 멀티모달 모델은 멀티모달 콘텐츠 생성 및 편집에서 유망한 결과를 보여왔지만, 여전히 주로 이미지 영역에 국한되어 있습니다. 본 연구에서는 통합 모델링을 비디오 영역으로 확장하는 다목적 프레임워크인 UniVideo를 제시합니다. UniVideo는 명령어 이해를 위한 멀티모달 대형 언어 모델(MLLM)과 비디오 생성을 위한 멀티모달 DiT(MMDiT)를 결합한 이중 스트림 설계를 채택합니다. 이 설계는 복잡한 멀티모달 명령어를 정확하게 해석하면서도 시각적 일관성을 유지할 수 있게 합니다. 이러한 아키텍처를 기반으로, UniVideo는 다양한 비디오 생성 및 편집 작업을 단일 멀티모달 명령어 패러다임 아래 통합하고 이를 공동으로 학습합니다. 광범위한 실험을 통해 UniVideo는 텍스트/이미지-투-비디오 생성, 컨텍스트 내 비디오 생성 및 컨텍스트 내 비디오 편집에서 최신 작업별 베이스라인을 능가하거나 동등한 성능을 보임을 입증했습니다. 특히, UniVideo의 통합 설계는 두 가지 형태의 일반화를 가능하게 합니다. 첫째, UniVideo는 단일 명령어 내에서 여러 기능을 통합하여 스타일 변환과 편집을 결합하는 등의 작업 구성(task composition)을 지원합니다. 둘째, 자유형 비디오 편집에 대한 명시적 학습 없이도 UniVideo는 대규모 이미지 편집 데이터에서 이 설정으로 편집 능력을 전이하여, 비디오 내 캐릭터 그린 스크린 처리나 재질 변경과 같은 보지 못한 명령어를 처리합니다. 이러한 핵심 기능 외에도, UniVideo는 시각적 프롬프트 기반 비디오 생성도 지원하며, 여기서 MLLM은 시각적 프롬프트를 해석하고 MMDiT가 합성 과정을 안내합니다. 향후 연구를 촉진하기 위해, 우리는 모델과 코드를 공개할 예정입니다.

English

Unified multimodal models have shown promising results in multimodal content generation and editing but remain largely limited to the image domain. In this work, we present UniVideo, a versatile framework that extends unified modeling to the video domain. UniVideo adopts a dual-stream design, combining a Multimodal Large Language Model (MLLM) for instruction understanding with a Multimodal DiT (MMDiT) for video generation. This design enables accurate interpretation of complex multimodal instructions while preserving visual consistency. Built on this architecture, UniVideo unifies diverse video generation and editing tasks under a single multimodal instruction paradigm and is jointly trained across them. Extensive experiments demonstrate that UniVideo matches or surpasses state-of-the-art task-specific baselines in text/image-to-video generation, in-context video generation and in-context video editing. Notably, the unified design of UniVideo enables two forms of generalization. First, UniVideo supports task composition, such as combining editing with style transfer, by integrating multiple capabilities within a single instruction. Second, even without explicit training on free-form video editing, UniVideo transfers its editing capability from large-scale image editing data to this setting, handling unseen instructions such as green-screening characters or changing materials within a video. Beyond these core capabilities, UniVideo also supports visual-prompt-based video generation, where the MLLM interprets visual prompts and guides the MMDiT during synthesis. To foster future research, we will release our model and code.

UniVideo: 비디오를 위한 통합적 이해, 생성 및 편집

UniVideo: Unified Understanding, Generation, and Editing for Videos

초록

Support