MAGREF：面向任意参考视频生成的掩码引导技术

摘要

随着深度生成模型，尤其是基于扩散方法的技术崛起，视频生成已取得显著进展。然而，基于多参考主体的视频生成在保持多主体一致性和确保高质量生成方面仍面临重大挑战。本文提出MAGREF，一个统一的多参考视频生成框架，通过引入掩码指导，实现在多样参考图像和文本提示条件下的连贯多主体视频合成。具体而言，我们提出了（1）区域感知的动态掩码机制，使单一模型无需架构改动即可灵活处理包括人物、物体及背景在内的各类主体推理；（2）像素级通道拼接机制，作用于通道维度以更好地保留外观特征。我们的模型在视频生成质量上达到业界领先水平，从单主体训练泛化至复杂多主体场景，实现连贯合成与对单个主体的精确控制，超越了现有开源及商业基线。为促进评估，我们还引入了一个全面的多主体视频基准。大量实验验证了方法的有效性，为可扩展、可控且高保真的多主体视频合成铺平了道路。代码与模型可访问：https://github.com/MAGREF-Video/MAGREF。

English

Video generation has made substantial strides with the emergence of deep generative models, especially diffusion-based approaches. However, video generation based on multiple reference subjects still faces significant challenges in maintaining multi-subject consistency and ensuring high generation quality. In this paper, we propose MAGREF, a unified framework for any-reference video generation that introduces masked guidance to enable coherent multi-subject video synthesis conditioned on diverse reference images and a textual prompt. Specifically, we propose (1) a region-aware dynamic masking mechanism that enables a single model to flexibly handle various subject inference, including humans, objects, and backgrounds, without architectural changes, and (2) a pixel-wise channel concatenation mechanism that operates on the channel dimension to better preserve appearance features. Our model delivers state-of-the-art video generation quality, generalizing from single-subject training to complex multi-subject scenarios with coherent synthesis and precise control over individual subjects, outperforming existing open-source and commercial baselines. To facilitate evaluation, we also introduce a comprehensive multi-subject video benchmark. Extensive experiments demonstrate the effectiveness of our approach, paving the way for scalable, controllable, and high-fidelity multi-subject video synthesis. Code and model can be found at: https://github.com/MAGREF-Video/MAGREF

MAGREF：面向任意参考视频生成的掩码引导技术

MAGREF: Masked Guidance for Any-Reference Video Generation

摘要

Support