ビデオ拡散モデルの多様な制御のための時間的インコンテキストファインチューニング

要旨

テキストからビデオを生成する拡散モデルの最近の進展により、高品質なビデオ合成が可能になりましたが、特にデータや計算リソースが限られている状況での制御可能な生成は依然として課題です。既存の条件付き生成のためのファインチューニング手法は、外部エンコーダやアーキテクチャの変更に依存することが多く、大規模なデータセットを必要とし、通常は空間的に整列した条件付けに限定されるため、柔軟性と拡張性が制限されています。本研究では、事前学習済みのビデオ拡散モデルを多様な条件付き生成タスクに適応させるための効率的で汎用的なアプローチであるTemporal In-Context Fine-Tuning (TIC-FT)を提案します。私たちのキーアイデアは、条件フレームとターゲットフレームを時間軸に沿って連結し、ノイズレベルを段階的に増加させた中間バッファフレームを挿入することです。これらのバッファフレームはスムーズな遷移を可能にし、ファインチューニングプロセスを事前学習モデルの時間的ダイナミクスに整合させます。TIC-FTはアーキテクチャの変更を必要とせず、わずか10～30のトレーニングサンプルで強力な性能を発揮します。私たちは、CogVideoX-5BやWan-14Bなどの大規模ベースモデルを使用して、画像からビデオやビデオからビデオの生成を含むさまざまなタスクでこの手法を検証しました。広範な実験により、TIC-FTが条件の忠実度と視覚品質の両方で既存のベースラインを上回り、トレーニングと推論の両方で高い効率性を維持することが示されました。追加の結果については、https://kinam0252.github.io/TIC-FT/をご覧ください。

English

Recent advances in text-to-video diffusion models have enabled high-quality video synthesis, but controllable generation remains challenging, particularly under limited data and compute. Existing fine-tuning methods for conditional generation often rely on external encoders or architectural modifications, which demand large datasets and are typically restricted to spatially aligned conditioning, limiting flexibility and scalability. In this work, we introduce Temporal In-Context Fine-Tuning (TIC-FT), an efficient and versatile approach for adapting pretrained video diffusion models to diverse conditional generation tasks. Our key idea is to concatenate condition and target frames along the temporal axis and insert intermediate buffer frames with progressively increasing noise levels. These buffer frames enable smooth transitions, aligning the fine-tuning process with the pretrained model's temporal dynamics. TIC-FT requires no architectural changes and achieves strong performance with as few as 10-30 training samples. We validate our method across a range of tasks, including image-to-video and video-to-video generation, using large-scale base models such as CogVideoX-5B and Wan-14B. Extensive experiments show that TIC-FT outperforms existing baselines in both condition fidelity and visual quality, while remaining highly efficient in both training and inference. For additional results, visit https://kinam0252.github.io/TIC-FT/

ビデオ拡散モデルの多様な制御のための時間的インコンテキストファインチューニング

Temporal In-Context Fine-Tuning for Versatile Control of Video Diffusion Models

要旨

Support