UniMMVSR: カスケード型ビデオ超解像のための統一マルチモーダルフレームワーク

要旨

カスケード型ビデオ超解像は、大規模な基盤モデルを用いて高解像度ビデオを生成する際の計算負荷を軽減する有望な技術として登場しました。しかし、既存の研究は主にテキストからビデオを生成するタスクに限定されており、テキスト以外の生成条件を活用できていません。これらの条件は、マルチモーダルなビデオ生成において忠実性を確保するために重要です。本研究ではこの制約を解決するため、テキスト、画像、ビデオを含むハイブリッドモーダル条件を統合した初の生成型ビデオ超解像フレームワークであるUniMMVSRを提案します。潜在ビデオ拡散モデル内での条件注入戦略、学習スキーム、データ混合技術について包括的に検討しました。重要な課題は、ターゲットビデオとの相関が異なる全ての条件タイプをモデルが正確に活用できるよう、データ構築と条件利用方法を設計することでした。実験の結果、UniMMVSRは既存手法を大幅に上回り、優れたディテールとマルチモーダル条件への高い適合度を備えたビデオを生成することが示されました。また、UniMMVSRをベースモデルと組み合わせることで、4Kビデオのマルチモーダルガイド付き生成を実現する可能性を検証しました。これは既存技術では達成できなかった成果です。

English

Cascaded video super-resolution has emerged as a promising technique for decoupling the computational burden associated with generating high-resolution videos using large foundation models. Existing studies, however, are largely confined to text-to-video tasks and fail to leverage additional generative conditions beyond text, which are crucial for ensuring fidelity in multi-modal video generation. We address this limitation by presenting UniMMVSR, the first unified generative video super-resolution framework to incorporate hybrid-modal conditions, including text, images, and videos. We conduct a comprehensive exploration of condition injection strategies, training schemes, and data mixture techniques within a latent video diffusion model. A key challenge was designing distinct data construction and condition utilization methods to enable the model to precisely utilize all condition types, given their varied correlations with the target video. Our experiments demonstrate that UniMMVSR significantly outperforms existing methods, producing videos with superior detail and a higher degree of conformity to multi-modal conditions. We also validate the feasibility of combining UniMMVSR with a base model to achieve multi-modal guided generation of 4K video, a feat previously unattainable with existing techniques.

UniMMVSR: カスケード型ビデオ超解像のための統一マルチモーダルフレームワーク

UniMMVSR: A Unified Multi-Modal Framework for Cascaded Video Super-Resolution

要旨

Support