SwiftI2V: 条件付きセグメント単位生成による高効率な高解像度画像-動画生成

要旨

高解像度画像動画変換（I2V）生成は、入力画像の細部にわたる外観詳細を保持しつつ、現実的な時間的ダイナミクスを合成することを目的とする。2K解像度ではこの課題は極めて困難となり、既存手法には以下の弱点が存在する：1）エンドツーエンドモデルはメモリ使用量と遅延が過大になりがちである；2）低解像度生成と汎用動画超解像をカスケード接続する手法では、超解像段階が入力画像を明示的条件としないため、詳細の虚構化や入力固有の局所構造からの乖離が生じやすい。これに対し、我々は高解像度I2Vに特化した効率的フレームワークSwiftI2Vを提案する。広く用いられる2段階設計に従い、まず低解像度の動き参照を生成してトークンコストを削減しモデリング負荷を軽減した後、動き誘導による強く画像条件付けされた2K合成を実行することで、制御されたオーバーヘッドで入力忠実な詳細を復元する、という効率性と忠実性のジレンマを解決する。具体的には、生成のスケーラビリティ向上のため、SwiftI2Vは条件付きセグメント単位生成（CSG）を導入して段階的トークン予算内でのセグメント単位動画合成を実現し、各セグメント内の双方向文脈相互作用によりセグメント間の一貫性と入力忠実性を向上させる。2K解像度のVBench-I2Vにおいて、SwiftI2Vはエンドツーエンドベースラインと同等の性能を達成しつつ、総GPU時間を202分の1に削減した。特に、単一のデータセンターGPU（H800等）またはコンシューマーGPU（RTX 4090等）での実用的な2K I2V生成を可能とする。

English

High-resolution image-to-video (I2V) generation aims to synthesize realistic temporal dynamics while preserving fine-grained appearance details of the input image. At 2K resolution, it becomes extremely challenging, and existing solutions suffer from various weaknesses: 1) end-to-end models are often prohibitively expensive in memory and latency; 2) cascading low-resolution generation with a generic video super-resolution tends to hallucinate details and drift from input-specific local structures, since the super-resolution stage is not explicitly conditioned on the input image. To this end, we propose SwiftI2V, an efficient framework tailored for high-resolution I2V. Following the widely used two-stage design, it addresses the efficiency--fidelity dilemma by first generating a low-resolution motion reference to reduce token costs and ease the modeling burden, then performing a strongly image-conditioned 2K synthesis guided by the motion to recover input-faithful details with controlled overhead. Specifically, to make generation more scalable, SwiftI2V introduces Conditional Segment-wise Generation (CSG) to synthesize videos segment-by-segment with a bounded per-step token budget, and adopts bidirectional contextual interaction within each segment to improve cross-segment coherence and input fidelity. On VBench-I2V at 2K resolution, SwiftI2V achieves performance comparable to end-to-end baselines while reducing total GPU-time by 202x. Particularly, it enables practical 2K I2V generation on a single datacenter GPU (e.g., H800) or consumer GPU (e.g., RTX 4090).

SwiftI2V: 条件付きセグメント単位生成による高効率な高解像度画像-動画生成

SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation

要旨

Support