MAGREF:面向任意参考视频生成的掩码引导技术
MAGREF: Masked Guidance for Any-Reference Video Generation
May 29, 2025
作者: Yufan Deng, Xun Guo, Yuanyang Yin, Jacob Zhiyuan Fang, Yiding Yang, Yizhi Wang, Shenghai Yuan, Angtian Wang, Bo Liu, Haibin Huang, Chongyang Ma
cs.AI
摘要
随着深度生成模型,尤其是基于扩散方法的技术崛起,视频生成已取得显著进展。然而,基于多参考主体的视频生成在保持多主体一致性和确保高质量生成方面仍面临重大挑战。本文提出MAGREF,一个统一的多参考视频生成框架,通过引入掩码指导,实现在多样参考图像和文本提示条件下的连贯多主体视频合成。具体而言,我们提出了(1)区域感知的动态掩码机制,使单一模型无需架构改动即可灵活处理包括人物、物体及背景在内的各类主体推理;(2)像素级通道拼接机制,作用于通道维度以更好地保留外观特征。我们的模型在视频生成质量上达到业界领先水平,从单主体训练泛化至复杂多主体场景,实现连贯合成与对单个主体的精确控制,超越了现有开源及商业基线。为促进评估,我们还引入了一个全面的多主体视频基准。大量实验验证了方法的有效性,为可扩展、可控且高保真的多主体视频合成铺平了道路。代码与模型可访问:https://github.com/MAGREF-Video/MAGREF。
English
Video generation has made substantial strides with the emergence of deep
generative models, especially diffusion-based approaches. However, video
generation based on multiple reference subjects still faces significant
challenges in maintaining multi-subject consistency and ensuring high
generation quality. In this paper, we propose MAGREF, a unified framework for
any-reference video generation that introduces masked guidance to enable
coherent multi-subject video synthesis conditioned on diverse reference images
and a textual prompt. Specifically, we propose (1) a region-aware dynamic
masking mechanism that enables a single model to flexibly handle various
subject inference, including humans, objects, and backgrounds, without
architectural changes, and (2) a pixel-wise channel concatenation mechanism
that operates on the channel dimension to better preserve appearance features.
Our model delivers state-of-the-art video generation quality, generalizing from
single-subject training to complex multi-subject scenarios with coherent
synthesis and precise control over individual subjects, outperforming existing
open-source and commercial baselines. To facilitate evaluation, we also
introduce a comprehensive multi-subject video benchmark. Extensive experiments
demonstrate the effectiveness of our approach, paving the way for scalable,
controllable, and high-fidelity multi-subject video synthesis. Code and model
can be found at: https://github.com/MAGREF-Video/MAGREFSummary
AI-Generated Summary