MotionRAG: モーション検索拡張型画像-動画生成

要旨

画像から動画生成は、拡散モデルの進展により著しい進歩を遂げてきたが、現実的な動きを持つ動画の生成は依然として非常に困難である。この難しさは、物理的な制約、物体間の相互作用、および領域固有のダイナミクスを正確にモデル化する複雑さに起因しており、これらは多様なシナリオにわたって容易に一般化できない。この問題に対処するため、我々はMotionRAGを提案する。これは、関連する参照動画から運動の事前知識を適応させることで、運動のリアリズムを向上させる検索拡張フレームワークであり、Context-Aware Motion Adaptation (CAMA)を介して実現される。主な技術的革新点は以下の通りである：(i) ビデオエンコーダと専用のリサンプラーを使用して高レベルの運動特徴を抽出し、意味的な運動表現を蒸留する検索ベースのパイプライン、(ii) 因果的トランスフォーマーアーキテクチャを介して実装された、運動適応のためのインコンテキスト学習アプローチ、(iii) 転送された運動特徴を事前学習済みのビデオ拡散モデルにシームレスに統合するアテンションベースの運動注入アダプター。大規模な実験により、我々の手法が複数の領域および様々なベースモデルにおいて、推論時の計算オーバーヘッドをほとんど伴わずに大幅な改善を達成することが示された。さらに、我々のモジュール設計により、検索データベースを更新するだけで、コンポーネントの再学習なしに新しい領域へのゼロショット一般化が可能となる。本研究は、運動の事前知識の効果的な検索と転送を可能にすることで、ビデオ生成システムのコア能力を強化し、現実的な運動ダイナミクスの合成を促進するものである。

English

Image-to-video generation has made remarkable progress with the advancements in diffusion models, yet generating videos with realistic motion remains highly challenging. This difficulty arises from the complexity of accurately modeling motion, which involves capturing physical constraints, object interactions, and domain-specific dynamics that are not easily generalized across diverse scenarios. To address this, we propose MotionRAG, a retrieval-augmented framework that enhances motion realism by adapting motion priors from relevant reference videos through Context-Aware Motion Adaptation (CAMA). The key technical innovations include: (i) a retrieval-based pipeline extracting high-level motion features using video encoder and specialized resamplers to distill semantic motion representations; (ii) an in-context learning approach for motion adaptation implemented through a causal transformer architecture; (iii) an attention-based motion injection adapter that seamlessly integrates transferred motion features into pretrained video diffusion models. Extensive experiments demonstrate that our method achieves significant improvements across multiple domains and various base models, all with negligible computational overhead during inference. Furthermore, our modular design enables zero-shot generalization to new domains by simply updating the retrieval database without retraining any components. This research enhances the core capability of video generation systems by enabling the effective retrieval and transfer of motion priors, facilitating the synthesis of realistic motion dynamics.

MotionRAG: モーション検索拡張型画像-動画生成

MotionRAG: Motion Retrieval-Augmented Image-to-Video Generation

要旨

Support