MAGREF: 임의 참조 비디오 생성을 위한 마스크 기반 가이던스

초록

비디오 생성은 딥 생성 모델, 특히 확산 기반 접근법의 등장으로 상당한 진전을 이루었습니다. 그러나 다중 참조 대상 기반의 비디오 생성은 여전히 다중 대상 일관성 유지와 높은 생성 품질 보장에서 상당한 어려움에 직면해 있습니다. 본 논문에서는 다양한 참조 이미지와 텍스트 프롬프트를 조건으로 하여 일관된 다중 대상 비디오 합성을 가능하게 하는 마스크 가이던스를 도입한 통합 프레임워크인 MAGREF를 제안합니다. 구체적으로, 우리는 (1) 단일 모델이 아키텍처 변경 없이 인간, 객체, 배경을 포함한 다양한 대상 추론을 유연하게 처리할 수 있는 지역 인식 동적 마스킹 메커니즘과 (2) 채널 차원에서 작동하여 외형 특징을 더 잘 보존하는 픽셀 단위 채널 연결 메커니즘을 제안합니다. 우리의 모델은 단일 대상 훈련에서 복잡한 다중 대상 시나리오로 일반화되며, 일관된 합성과 개별 대상에 대한 정밀한 제어를 통해 최신 비디오 생성 품질을 제공하며, 기존의 오픈소스 및 상용 베이스라인을 능가합니다. 평가를 용이하게 하기 위해, 우리는 또한 포괄적인 다중 대상 비디오 벤치마크를 소개합니다. 광범위한 실험을 통해 우리의 접근법의 효과를 입증하며, 확장 가능하고 제어 가능하며 고품질의 다중 대상 비디오 합성을 위한 길을 열어줍니다. 코드와 모델은 다음에서 확인할 수 있습니다: https://github.com/MAGREF-Video/MAGREF

English

Video generation has made substantial strides with the emergence of deep generative models, especially diffusion-based approaches. However, video generation based on multiple reference subjects still faces significant challenges in maintaining multi-subject consistency and ensuring high generation quality. In this paper, we propose MAGREF, a unified framework for any-reference video generation that introduces masked guidance to enable coherent multi-subject video synthesis conditioned on diverse reference images and a textual prompt. Specifically, we propose (1) a region-aware dynamic masking mechanism that enables a single model to flexibly handle various subject inference, including humans, objects, and backgrounds, without architectural changes, and (2) a pixel-wise channel concatenation mechanism that operates on the channel dimension to better preserve appearance features. Our model delivers state-of-the-art video generation quality, generalizing from single-subject training to complex multi-subject scenarios with coherent synthesis and precise control over individual subjects, outperforming existing open-source and commercial baselines. To facilitate evaluation, we also introduce a comprehensive multi-subject video benchmark. Extensive experiments demonstrate the effectiveness of our approach, paving the way for scalable, controllable, and high-fidelity multi-subject video synthesis. Code and model can be found at: https://github.com/MAGREF-Video/MAGREF

MAGREF: 임의 참조 비디오 생성을 위한 마스크 기반 가이던스

MAGREF: Masked Guidance for Any-Reference Video Generation

초록

Support