V-Bridge: 映像生成の事前知識を多様な数ショット画像復元に橋渡しする

要旨

大規模ビデオ生成モデルは、膨大で多様な視覚データで学習されることで、視覚世界の豊かな構造的・意味的・動的プリオリを内在化している。これらのモデルは印象的な生成能力を示す一方で、汎用視覚学習器としての潜在的可能性は未だ十分に活用されていない。本研究では、この潜在能力を多様な数ショット画像復元タスクに接続するフレームワーク「V-Bridge」を提案する。我々は画像復元を静的な回帰問題ではなく、漸進的な生成プロセスとして再解釈し、ビデオモデルを活用して劣化入力から高精細出力への段階的な精緻化をシミュレートする。驚くべきことに、わずか1,000のマルチタスク学習サンプル（既存の復元手法の2%未満）を用いるだけで、事前学習済みビデオモデルが競争力のある画像復元を実現し、単一モデルで複数タスクを実行可能であり、専用設計されたアーキテクチャに匹敵する性能を発揮する。我々の発見は、ビデオ生成モデルが極限的に少量のデータで活性化可能な強力かつ転移可能な復元プリオリを暗黙的に学習していることを示し、生成モデリングと低レベル視覚の従来の境界に疑問を投げかけ、視覚タスクにおける基盤モデルの新たな設計パラダイムを開拓するものである。

English

Large-scale video generative models are trained on vast and diverse visual data, enabling them to internalize rich structural, semantic, and dynamic priors of the visual world. While these models have demonstrated impressive generative capability, their potential as general-purpose visual learners remains largely untapped. In this work, we introduce V-Bridge, a framework that bridges this latent capacity to versatile few-shot image restoration tasks. We reinterpret image restoration not as a static regression problem, but as a progressive generative process, and leverage video models to simulate the gradual refinement from degraded inputs to high-fidelity outputs. Surprisingly, with only 1,000 multi-task training samples (less than 2% of existing restoration methods), pretrained video models can be induced to perform competitive image restoration, achieving multiple tasks with a single model, rivaling specialized architectures designed explicitly for this purpose. Our findings reveal that video generative models implicitly learn powerful and transferable restoration priors that can be activated with only extremely limited data, challenging the traditional boundary between generative modeling and low-level vision, and opening a new design paradigm for foundation models in visual tasks.

V-Bridge: 映像生成の事前知識を多様な数ショット画像復元に橋渡しする

V-Bridge: Bridging Video Generative Priors to Versatile Few-shot Image Restoration

要旨

Support