UniMMVSR：一個統一的多模態框架用於級聯式視頻超分辨率

摘要

級聯視頻超分辨率技術已成為一種頗具前景的方法，用於解耦使用大型基礎模型生成高分辨率視頻所帶來的計算負擔。然而，現有研究主要局限於文本到視頻任務，未能充分利用文本之外的其他生成條件，而這些條件對於確保多模態視頻生成的保真度至關重要。我們通過提出UniMMVSR來解決這一限制，這是首個統一生成視頻超分辨率框架，能夠整合包括文本、圖像和視頻在內的混合模態條件。我們在潛在視頻擴散模型中全面探索了條件注入策略、訓練方案和數據混合技術。一個關鍵挑戰在於設計不同的數據構建和條件利用方法，使模型能夠精確利用所有條件類型，考慮到它們與目標視頻之間的不同關聯性。我們的實驗表明，UniMMVSR顯著優於現有方法，生成的視頻具有更豐富的細節和更高的多模態條件一致性。我們還驗證了將UniMMVSR與基礎模型結合以實現多模態引導的4K視頻生成的可行性，這一成就是現有技術之前無法實現的。

English

Cascaded video super-resolution has emerged as a promising technique for decoupling the computational burden associated with generating high-resolution videos using large foundation models. Existing studies, however, are largely confined to text-to-video tasks and fail to leverage additional generative conditions beyond text, which are crucial for ensuring fidelity in multi-modal video generation. We address this limitation by presenting UniMMVSR, the first unified generative video super-resolution framework to incorporate hybrid-modal conditions, including text, images, and videos. We conduct a comprehensive exploration of condition injection strategies, training schemes, and data mixture techniques within a latent video diffusion model. A key challenge was designing distinct data construction and condition utilization methods to enable the model to precisely utilize all condition types, given their varied correlations with the target video. Our experiments demonstrate that UniMMVSR significantly outperforms existing methods, producing videos with superior detail and a higher degree of conformity to multi-modal conditions. We also validate the feasibility of combining UniMMVSR with a base model to achieve multi-modal guided generation of 4K video, a feat previously unattainable with existing techniques.

UniMMVSR：一個統一的多模態框架用於級聯式視頻超分辨率

UniMMVSR: A Unified Multi-Modal Framework for Cascaded Video Super-Resolution

摘要

Support