Uni-ViGU：基于扩散模型的视频生成器实现统一视频生成与理解

摘要

统一多模态模型在融合视觉理解与生成能力时面临一个根本性挑战：视觉生成（尤其是视频生成）的计算成本远高于理解任务。这种不平衡性促使我们颠覆传统范式：不再以理解为中心扩展多语言模型（MLLM）来支持生成，而是提出Uni-ViGU框架——通过扩展视频生成器作为基础来统一视频生成与理解。我们引入统一流匹配方法，在单一流程中实现对视频的连续流匹配与文本的离散流匹配，从而达成连贯的多模态生成。进一步提出基于模态驱动的混合专家框架，通过为Transformer模块添加轻量级文本生成层，同时保留生成先验知识。为将生成知识迁移至理解任务，我们设计包含两个阶段的双向训练机制：知识召回阶段通过重构输入提示词来利用已学习的文本-视频对应关系，能力精炼阶段则通过细粒度描述文本微调以建立判别性共享表征。实验表明，Uni-ViGU在视频生成与理解任务上均取得具有竞争力的性能，验证了以生成为核心的架构是实现统一多模态智能的可扩展路径。项目页面与代码：https://fr0zencrane.github.io/uni-vigu-page/。

English

Unified multimodal models integrating visual understanding and generation face a fundamental challenge: visual generation incurs substantially higher computational costs than understanding, particularly for video. This imbalance motivates us to invert the conventional paradigm: rather than extending understanding-centric MLLMs to support generation, we propose Uni-ViGU, a framework that unifies video generation and understanding by extending a video generator as the foundation. We introduce a unified flow method that performs continuous flow matching for video and discrete flow matching for text within a single process, enabling coherent multimodal generation. We further propose a modality-driven MoE-based framework that augments Transformer blocks with lightweight layers for text generation while preserving generative priors. To repurpose generation knowledge for understanding, we design a bidirectional training mechanism with two stages: Knowledge Recall reconstructs input prompts to leverage learned text-video correspondences, while Capability Refinement fine-tunes on detailed captions to establish discriminative shared representations. Experiments demonstrate that Uni-ViGU achieves competitive performance on both video generation and understanding, validating generation-centric architectures as a scalable path toward unified multimodal intelligence. Project Page and Code: https://fr0zencrane.github.io/uni-vigu-page/.

Uni-ViGU：基于扩散模型的视频生成器实现统一视频生成与理解

Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator

摘要

Support