MotionRAG:基于运动检索增强的图像到视频生成
MotionRAG: Motion Retrieval-Augmented Image-to-Video Generation
September 30, 2025
作者: Chenhui Zhu, Yilu Wu, Shuai Wang, Gangshan Wu, Limin Wang
cs.AI
摘要
隨著擴散模型的進步,圖像到視頻生成已取得顯著進展,然而生成具有真實運動的視頻仍然極具挑戰性。這一難題源於準確建模運動的複雜性,其中涉及捕捉物理約束、物體交互以及特定領域的動態特性,這些特性難以在多樣化場景中普遍適用。為解決這一問題,我們提出了MotionRAG,這是一個檢索增強框架,通過上下文感知的運動適應(CAMA)從相關參考視頻中提取運動先驗,從而提升運動的真實感。關鍵技術創新包括:(i)基於檢索的管道,利用視頻編碼器和專用重採樣器提取高層次運動特徵,以提煉語義運動表示;(ii)通過因果變換器架構實現的上下文學習方法,用於運動適應;(iii)基於注意力的運動注入適配器,無縫整合轉移的運動特徵到預訓練的視頻擴散模型中。大量實驗表明,我們的方法在多個領域和各種基礎模型上均取得了顯著改進,且在推理過程中計算開銷可忽略不計。此外,我們的模塊化設計使得只需更新檢索數據庫而無需重新訓練任何組件,即可實現對新領域的零樣本泛化。這項研究通過實現運動先驗的有效檢索與轉移,增強了視頻生成系統的核心能力,促進了真實運動動態的合成。
English
Image-to-video generation has made remarkable progress with the advancements
in diffusion models, yet generating videos with realistic motion remains highly
challenging. This difficulty arises from the complexity of accurately modeling
motion, which involves capturing physical constraints, object interactions, and
domain-specific dynamics that are not easily generalized across diverse
scenarios. To address this, we propose MotionRAG, a retrieval-augmented
framework that enhances motion realism by adapting motion priors from relevant
reference videos through Context-Aware Motion Adaptation (CAMA). The key
technical innovations include: (i) a retrieval-based pipeline extracting
high-level motion features using video encoder and specialized resamplers to
distill semantic motion representations; (ii) an in-context learning approach
for motion adaptation implemented through a causal transformer architecture;
(iii) an attention-based motion injection adapter that seamlessly integrates
transferred motion features into pretrained video diffusion models. Extensive
experiments demonstrate that our method achieves significant improvements
across multiple domains and various base models, all with negligible
computational overhead during inference. Furthermore, our modular design
enables zero-shot generalization to new domains by simply updating the
retrieval database without retraining any components. This research enhances
the core capability of video generation systems by enabling the effective
retrieval and transfer of motion priors, facilitating the synthesis of
realistic motion dynamics.