ChatPaper.aiChatPaper

Uni-ViGU:基于扩散模型的视频生成器实现统一视频生成与理解

Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator

April 9, 2026
作者: Luozheng Qin, Jia Gong, Qian Qiao, Tianjiao Li, Li Xu, Haoyu Pan, Chao Qu, Zhiyu Tan, Hao Li
cs.AI

摘要

统一多模态模型在融合视觉理解与生成能力时面临一个根本性挑战:视觉生成(尤其是视频生成)的计算成本远高于理解任务。这种不平衡性促使我们颠覆传统范式:不再以理解为中心扩展多语言模型(MLLM)来支持生成,而是提出Uni-ViGU框架——通过扩展视频生成器作为基础来统一视频生成与理解。我们引入统一流匹配方法,在单一流程中实现对视频的连续流匹配与文本的离散流匹配,从而达成连贯的多模态生成。进一步提出基于模态驱动的混合专家框架,通过为Transformer模块添加轻量级文本生成层,同时保留生成先验知识。为将生成知识迁移至理解任务,我们设计包含两个阶段的双向训练机制:知识召回阶段通过重构输入提示词来利用已学习的文本-视频对应关系,能力精炼阶段则通过细粒度描述文本微调以建立判别性共享表征。实验表明,Uni-ViGU在视频生成与理解任务上均取得具有竞争力的性能,验证了以生成为核心的架构是实现统一多模态智能的可扩展路径。项目页面与代码:https://fr0zencrane.github.io/uni-vigu-page/。
English
Unified multimodal models integrating visual understanding and generation face a fundamental challenge: visual generation incurs substantially higher computational costs than understanding, particularly for video. This imbalance motivates us to invert the conventional paradigm: rather than extending understanding-centric MLLMs to support generation, we propose Uni-ViGU, a framework that unifies video generation and understanding by extending a video generator as the foundation. We introduce a unified flow method that performs continuous flow matching for video and discrete flow matching for text within a single process, enabling coherent multimodal generation. We further propose a modality-driven MoE-based framework that augments Transformer blocks with lightweight layers for text generation while preserving generative priors. To repurpose generation knowledge for understanding, we design a bidirectional training mechanism with two stages: Knowledge Recall reconstructs input prompts to leverage learned text-video correspondences, while Capability Refinement fine-tunes on detailed captions to establish discriminative shared representations. Experiments demonstrate that Uni-ViGU achieves competitive performance on both video generation and understanding, validating generation-centric architectures as a scalable path toward unified multimodal intelligence. Project Page and Code: https://fr0zencrane.github.io/uni-vigu-page/.
PDF392April 15, 2026