UniVidX：基於擴散先驗的通用多模態影片生成統一框架

摘要

近期研究表明，視頻擴散模型（VDM）可被重新應用於多種多模態圖形任務。然而，現有方法通常針對不同問題設定分別訓練模型，這種固定輸入-輸出的映射方式限制了跨模態關聯的建模。我們提出UniVidX——一個基於VDM先驗的統一多模態影片生成框架。該框架將像素級對齊任務建模為共享多模態空間中的條件生成，在適應模態特定分佈的同時保留骨幹網絡的固有先驗，並在合成過程中促進跨模態一致性。其核心設計包含三項關鍵技術：隨機條件掩碼（SCM）在訓練時將模態隨機劃分為乾淨條件與噪聲目標，實現全向條件生成而非固定映射；解耦門控LoRA（DGL）為各模態配置獨立LoRA模組，僅在該模態作為生成目標時激活，從而保留VDM的強先驗；跨模態自注意力（CMSA）通過共享跨模態的鍵值對並保留模態特定查詢，促進信息交互與模態對齊。我們在兩個領域實例化UniVidX：UniVid-Intrinsic用於RGB影片與本徵圖（含反照率、光照強度與法線圖）；UniVid-Alpha用於混合RGB影片及其組成的RGBA圖層。實驗表明，兩個模型在各自任務中均達到與頂尖方法相當的性能，並在僅使用不足1,000段影片訓練的情況下，對真實場景展現出強健的泛化能力。項目頁面：https://houyuanchen111.github.io/UniVidX.github.io/

English

Recent progress has shown that video diffusion models (VDMs) can be repurposed for diverse multimodal graphics tasks. However, existing methods often train separate models for each problem setting, which fixes the input-output mapping and limits the modeling of correlations across modalities. We present UniVidX, a unified multimodal framework that leverages VDM priors for versatile video generation. UniVidX formulates pixel-aligned tasks as conditional generation in a shared multimodal space, adapts to modality-specific distributions while preserving the backbone's native priors, and promotes cross-modal consistency during synthesis. It is built on three key designs. Stochastic Condition Masking (SCM) randomly partitions modalities into clean conditions and noisy targets during training, enabling omni-directional conditional generation instead of fixed mappings. Decoupled Gated LoRA (DGL) introduces per-modality LoRAs that are activated when a modality serves as the generation target, preserving the strong priors of the VDM. Cross-Modal Self-Attention (CMSA) shares keys and values across modalities while keeping modality-specific queries, facilitating information exchange and inter-modal alignment. We instantiate UniVidX in two domains: UniVid-Intrinsic, for RGB videos and intrinsic maps including albedo, irradiance, and normal; and UniVid-Alpha, for blended RGB videos and their constituent RGBA layers. Experiments show that both models achieve performance competitive with state-of-the-art methods across distinct tasks and generalize robustly to in-the-wild scenarios, even when trained on fewer than 1,000 videos. Project page: https://houyuanchen111.github.io/UniVidX.github.io/

UniVidX：基於擴散先驗的通用多模態影片生成統一框架

UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors

摘要

Support