UniMMVSR: 캐스케이드 비디오 초해상화를 위한 통합 다중 모달 프레임워크

초록

캐스케이드 비디오 초해상도는 대형 기반 모델을 사용하여 고해상도 비디오를 생성하는 데 따른 계산 부담을 분리하기 위한 유망한 기술로 부상했습니다. 그러나 기존 연구는 주로 텍스트-투-비디오 작업에 국한되어 있으며, 텍스트 외의 추가적인 생성 조건을 활용하지 못하고 있습니다. 이러한 조건은 다중 모달 비디오 생성에서 충실도를 보장하기 위해 중요합니다. 우리는 이러한 한계를 해결하기 위해 텍스트, 이미지, 비디오를 포함한 하이브리드 모달 조건을 통합한 최초의 통합 생성 비디오 초해상도 프레임워크인 UniMMVSR을 제시합니다. 우리는 잠재 비디오 확산 모델 내에서 조건 주입 전략, 훈련 방식, 데이터 혼합 기술에 대한 포괄적인 탐구를 수행했습니다. 주요 과제는 목표 비디오와의 다양한 상관 관계를 고려하여 모델이 모든 조건 유형을 정확하게 활용할 수 있도록 별도의 데이터 구성 및 조건 활용 방법을 설계하는 것이었습니다. 우리의 실험은 UniMMVSR이 기존 방법을 크게 능가하며, 우수한 디테일과 다중 모달 조건에 대한 높은 준수도를 가진 비디오를 생성한다는 것을 보여줍니다. 또한 우리는 UniMMVSR을 기본 모델과 결합하여 기존 기술로는 달성할 수 없었던 다중 모달 가이드 4K 비디오 생성을 달성할 수 있는 가능성을 검증했습니다.

English

Cascaded video super-resolution has emerged as a promising technique for decoupling the computational burden associated with generating high-resolution videos using large foundation models. Existing studies, however, are largely confined to text-to-video tasks and fail to leverage additional generative conditions beyond text, which are crucial for ensuring fidelity in multi-modal video generation. We address this limitation by presenting UniMMVSR, the first unified generative video super-resolution framework to incorporate hybrid-modal conditions, including text, images, and videos. We conduct a comprehensive exploration of condition injection strategies, training schemes, and data mixture techniques within a latent video diffusion model. A key challenge was designing distinct data construction and condition utilization methods to enable the model to precisely utilize all condition types, given their varied correlations with the target video. Our experiments demonstrate that UniMMVSR significantly outperforms existing methods, producing videos with superior detail and a higher degree of conformity to multi-modal conditions. We also validate the feasibility of combining UniMMVSR with a base model to achieve multi-modal guided generation of 4K video, a feat previously unattainable with existing techniques.