UniVidX:基于扩散先验的统一多模态视频生成框架
UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors
May 1, 2026
作者: Houyuan Chen, Hong Li, Xianghao Kong, Tianrui Zhu, Shaocong Xu, Weiqing Xiao, Yuwei Guo, Chongjie Ye, Lvmin Zhang, Hao Zhao, Anyi Rao
cs.AI
摘要
近期研究表明,视频扩散模型(VDM)可被重新应用于多种多模态图形任务。然而,现有方法通常针对不同问题设置分别训练模型,这种固定化的输入输出映射限制了跨模态相关性的建模。我们提出UniVidX——一个基于VDM先验的统一多模态视频生成框架。该框架将像素对齐任务建模为共享多模态空间中的条件生成,在适配模态特定分布的同时保留主干网络的原始先验,并在合成过程中促进跨模态一致性。其核心设计包含三个关键机制:随机条件掩码(SCM)在训练期间将模态随机划分为干净条件与噪声目标,实现全向条件生成而非固定映射;解耦门控LoRA(DGL)为每个模态配备独立LoRA模块,当模态作为生成目标时激活该模块,从而保留VDM的强先验;跨模态自注意力(CMSA)通过共享跨模态键值对并保持模态特定查询,有效促进信息交换与模态间对齐。我们在两个领域实例化UniVidX:UniVid-Intrinsic用于RGB视频与反照率、辐照度、法向等本征图的生成;UniVid-Alpha用于混合RGB视频及其RGBA组成层的生成。实验表明,两个模型在各类任务中均达到与最先进方法相媲美的性能,即使在少于1000个视频的训练数据下,也能对真实场景展现出强大的泛化能力。项目页面:https://houyuanchen111.github.io/UniVidX.github.io/
English
Recent progress has shown that video diffusion models (VDMs) can be repurposed for diverse multimodal graphics tasks. However, existing methods often train separate models for each problem setting, which fixes the input-output mapping and limits the modeling of correlations across modalities. We present UniVidX, a unified multimodal framework that leverages VDM priors for versatile video generation. UniVidX formulates pixel-aligned tasks as conditional generation in a shared multimodal space, adapts to modality-specific distributions while preserving the backbone's native priors, and promotes cross-modal consistency during synthesis. It is built on three key designs. Stochastic Condition Masking (SCM) randomly partitions modalities into clean conditions and noisy targets during training, enabling omni-directional conditional generation instead of fixed mappings. Decoupled Gated LoRA (DGL) introduces per-modality LoRAs that are activated when a modality serves as the generation target, preserving the strong priors of the VDM. Cross-Modal Self-Attention (CMSA) shares keys and values across modalities while keeping modality-specific queries, facilitating information exchange and inter-modal alignment. We instantiate UniVidX in two domains: UniVid-Intrinsic, for RGB videos and intrinsic maps including albedo, irradiance, and normal; and UniVid-Alpha, for blended RGB videos and their constituent RGBA layers. Experiments show that both models achieve performance competitive with state-of-the-art methods across distinct tasks and generalize robustly to in-the-wild scenarios, even when trained on fewer than 1,000 videos. Project page: https://houyuanchen111.github.io/UniVidX.github.io/