MAGREF:面向任意參考視頻生成的掩碼引導技術
MAGREF: Masked Guidance for Any-Reference Video Generation
May 29, 2025
作者: Yufan Deng, Xun Guo, Yuanyang Yin, Jacob Zhiyuan Fang, Yiding Yang, Yizhi Wang, Shenghai Yuan, Angtian Wang, Bo Liu, Haibin Huang, Chongyang Ma
cs.AI
摘要
隨著深度生成模型,尤其是基於擴散方法的出現,視頻生成已取得顯著進展。然而,基於多個參考主體的視頻生成在保持多主體一致性和確保高生成質量方面仍面臨重大挑戰。本文提出MAGREF,這是一個面向任意參考視頻生成的統一框架,它引入了掩碼指導,以實現基於多樣參考圖像和文本提示的連貫多主體視頻合成。具體而言,我們提出了(1)一種區域感知的動態掩碼機制,使單一模型能夠靈活處理包括人物、物體和背景在內的各種主體推理,而無需改變架構;(2)一種在通道維度上運作的像素級通道拼接機制,以更好地保留外觀特徵。我們的模型在視頻生成質量上達到了業界領先水平,從單主體訓練泛化到複雜的多主體場景,實現了連貫的合成和對各個主體的精准控制,超越了現有的開源和商業基線。為了促進評估,我們還引入了一個全面的多主體視頻基準。大量實驗證明了我們方法的有效性,為可擴展、可控且高保真的多主體視頻合成鋪平了道路。代碼和模型可在以下網址找到:https://github.com/MAGREF-Video/MAGREF
English
Video generation has made substantial strides with the emergence of deep
generative models, especially diffusion-based approaches. However, video
generation based on multiple reference subjects still faces significant
challenges in maintaining multi-subject consistency and ensuring high
generation quality. In this paper, we propose MAGREF, a unified framework for
any-reference video generation that introduces masked guidance to enable
coherent multi-subject video synthesis conditioned on diverse reference images
and a textual prompt. Specifically, we propose (1) a region-aware dynamic
masking mechanism that enables a single model to flexibly handle various
subject inference, including humans, objects, and backgrounds, without
architectural changes, and (2) a pixel-wise channel concatenation mechanism
that operates on the channel dimension to better preserve appearance features.
Our model delivers state-of-the-art video generation quality, generalizing from
single-subject training to complex multi-subject scenarios with coherent
synthesis and precise control over individual subjects, outperforming existing
open-source and commercial baselines. To facilitate evaluation, we also
introduce a comprehensive multi-subject video benchmark. Extensive experiments
demonstrate the effectiveness of our approach, paving the way for scalable,
controllable, and high-fidelity multi-subject video synthesis. Code and model
can be found at: https://github.com/MAGREF-Video/MAGREFSummary
AI-Generated Summary