NaRCan：拡散モデルの事前知識を統合した自然な精緻化カノニカル画像によるビデオ編集

要旨

本論文では、入力動画を表現するための高品質な自然な正規化画像を生成するために、ハイブリッド変形場と拡散事前分布を統合した動画編集フレームワーク「NaRCan」を提案します。本手法は、ホモグラフィを用いてグローバルな動きをモデル化し、多層パーセプトロン（MLP）を用いて局所的な残差変形を捉えることで、複雑な動画のダイナミクスを扱う能力を向上させます。訓練の初期段階から拡散事前分布を導入することで、生成される画像が高品質で自然な外観を保つことを保証し、生成された正規化画像が様々な動画編集の下流タスクに適していることを実現します。これは、既存の正規化ベースの手法では達成できなかった能力です。さらに、低ランク適応（LoRA）のファインチューニングを組み込み、ノイズと拡散事前分布の更新スケジューリング技術を導入することで、訓練プロセスを14倍高速化します。広範な実験結果は、本手法が様々な動画編集タスクにおいて既存の手法を上回り、一貫性のある高品質な編集動画シーケンスを生成することを示しています。動画結果については、プロジェクトページ（https://koi953215.github.io/NaRCan_page/）をご覧ください。

English

We propose a video editing framework, NaRCan, which integrates a hybrid deformation field and diffusion prior to generate high-quality natural canonical images to represent the input video. Our approach utilizes homography to model global motion and employs multi-layer perceptrons (MLPs) to capture local residual deformations, enhancing the model's ability to handle complex video dynamics. By introducing a diffusion prior from the early stages of training, our model ensures that the generated images retain a high-quality natural appearance, making the produced canonical images suitable for various downstream tasks in video editing, a capability not achieved by current canonical-based methods. Furthermore, we incorporate low-rank adaptation (LoRA) fine-tuning and introduce a noise and diffusion prior update scheduling technique that accelerates the training process by 14 times. Extensive experimental results show that our method outperforms existing approaches in various video editing tasks and produces coherent and high-quality edited video sequences. See our project page for video results at https://koi953215.github.io/NaRCan_page/.

NaRCan：拡散モデルの事前知識を統合した自然な精緻化カノニカル画像によるビデオ編集

NaRCan: Natural Refined Canonical Image with Integration of Diffusion Prior for Video Editing

要旨

Support