UniMMVSR：面向级联视频超分辨率的统一多模态框架

摘要

级联视频超分辨率技术已成为一种颇具前景的方法，它能够有效分解使用大型基础模型生成高分辨率视频所带来的计算负担。然而，现有研究主要局限于文本到视频任务，未能充分利用文本之外的其他生成条件，而这些条件对于确保多模态视频生成的保真度至关重要。针对这一局限，我们提出了UniMMVSR，这是首个整合了混合模态条件（包括文本、图像和视频）的统一生成式视频超分辨率框架。我们在潜在视频扩散模型中，对条件注入策略、训练方案及数据混合技术进行了全面探索。一个关键挑战在于设计独特的数据构建和条件利用方法，使模型能够精确利用所有条件类型，考虑到它们与目标视频之间存在的不同关联性。实验结果表明，UniMMVSR显著优于现有方法，生成的视频细节更为丰富，且与多模态条件的符合度更高。我们还验证了将UniMMVSR与基础模型结合，实现多模态引导下4K视频生成的可行性，这一成就此前利用现有技术难以企及。

English

Cascaded video super-resolution has emerged as a promising technique for decoupling the computational burden associated with generating high-resolution videos using large foundation models. Existing studies, however, are largely confined to text-to-video tasks and fail to leverage additional generative conditions beyond text, which are crucial for ensuring fidelity in multi-modal video generation. We address this limitation by presenting UniMMVSR, the first unified generative video super-resolution framework to incorporate hybrid-modal conditions, including text, images, and videos. We conduct a comprehensive exploration of condition injection strategies, training schemes, and data mixture techniques within a latent video diffusion model. A key challenge was designing distinct data construction and condition utilization methods to enable the model to precisely utilize all condition types, given their varied correlations with the target video. Our experiments demonstrate that UniMMVSR significantly outperforms existing methods, producing videos with superior detail and a higher degree of conformity to multi-modal conditions. We also validate the feasibility of combining UniMMVSR with a base model to achieve multi-modal guided generation of 4K video, a feat previously unattainable with existing techniques.

UniMMVSR：面向级联视频超分辨率的统一多模态框架

UniMMVSR: A Unified Multi-Modal Framework for Cascaded Video Super-Resolution

摘要

Support