MAGREF：面向任意參考視頻生成的掩碼引導技術

摘要

隨著深度生成模型，尤其是基於擴散方法的出現，視頻生成已取得顯著進展。然而，基於多個參考主體的視頻生成在保持多主體一致性和確保高生成質量方面仍面臨重大挑戰。本文提出MAGREF，這是一個面向任意參考視頻生成的統一框架，它引入了掩碼指導，以實現基於多樣參考圖像和文本提示的連貫多主體視頻合成。具體而言，我們提出了（1）一種區域感知的動態掩碼機制，使單一模型能夠靈活處理包括人物、物體和背景在內的各種主體推理，而無需改變架構；（2）一種在通道維度上運作的像素級通道拼接機制，以更好地保留外觀特徵。我們的模型在視頻生成質量上達到了業界領先水平，從單主體訓練泛化到複雜的多主體場景，實現了連貫的合成和對各個主體的精准控制，超越了現有的開源和商業基線。為了促進評估，我們還引入了一個全面的多主體視頻基準。大量實驗證明了我們方法的有效性，為可擴展、可控且高保真的多主體視頻合成鋪平了道路。代碼和模型可在以下網址找到：https://github.com/MAGREF-Video/MAGREF

English

Video generation has made substantial strides with the emergence of deep generative models, especially diffusion-based approaches. However, video generation based on multiple reference subjects still faces significant challenges in maintaining multi-subject consistency and ensuring high generation quality. In this paper, we propose MAGREF, a unified framework for any-reference video generation that introduces masked guidance to enable coherent multi-subject video synthesis conditioned on diverse reference images and a textual prompt. Specifically, we propose (1) a region-aware dynamic masking mechanism that enables a single model to flexibly handle various subject inference, including humans, objects, and backgrounds, without architectural changes, and (2) a pixel-wise channel concatenation mechanism that operates on the channel dimension to better preserve appearance features. Our model delivers state-of-the-art video generation quality, generalizing from single-subject training to complex multi-subject scenarios with coherent synthesis and precise control over individual subjects, outperforming existing open-source and commercial baselines. To facilitate evaluation, we also introduce a comprehensive multi-subject video benchmark. Extensive experiments demonstrate the effectiveness of our approach, paving the way for scalable, controllable, and high-fidelity multi-subject video synthesis. Code and model can be found at: https://github.com/MAGREF-Video/MAGREF

MAGREF：面向任意參考視頻生成的掩碼引導技術

MAGREF: Masked Guidance for Any-Reference Video Generation

摘要

Support