MotionRAG:基于运动检索增强的图像到视频生成
MotionRAG: Motion Retrieval-Augmented Image-to-Video Generation
September 30, 2025
作者: Chenhui Zhu, Yilu Wu, Shuai Wang, Gangshan Wu, Limin Wang
cs.AI
摘要
随着扩散模型的进步,图像到视频生成已取得显著进展,然而生成具有真实运动感的视频仍极具挑战。这一难点源于准确建模运动的复杂性,包括捕捉物理约束、物体交互以及难以跨多样场景泛化的领域特定动态。为此,我们提出了MotionRAG,一个检索增强框架,通过上下文感知运动适应(CAMA)从相关参考视频中适配运动先验,从而提升运动真实感。关键技术创新包括:(i) 基于检索的管道,利用视频编码器和专用重采样器提取高层运动特征,以蒸馏语义运动表示;(ii) 通过因果Transformer架构实现的上下文学习运动适应方法;(iii) 基于注意力的运动注入适配器,无缝整合转移的运动特征到预训练的视频扩散模型中。大量实验表明,我们的方法在多个领域和多种基础模型上均实现了显著改进,且推理时计算开销极小。此外,模块化设计使得仅需更新检索数据库即可实现对新领域的零样本泛化,无需重新训练任何组件。本研究通过有效检索和转移运动先验,增强了视频生成系统的核心能力,促进了真实运动动态的合成。
English
Image-to-video generation has made remarkable progress with the advancements
in diffusion models, yet generating videos with realistic motion remains highly
challenging. This difficulty arises from the complexity of accurately modeling
motion, which involves capturing physical constraints, object interactions, and
domain-specific dynamics that are not easily generalized across diverse
scenarios. To address this, we propose MotionRAG, a retrieval-augmented
framework that enhances motion realism by adapting motion priors from relevant
reference videos through Context-Aware Motion Adaptation (CAMA). The key
technical innovations include: (i) a retrieval-based pipeline extracting
high-level motion features using video encoder and specialized resamplers to
distill semantic motion representations; (ii) an in-context learning approach
for motion adaptation implemented through a causal transformer architecture;
(iii) an attention-based motion injection adapter that seamlessly integrates
transferred motion features into pretrained video diffusion models. Extensive
experiments demonstrate that our method achieves significant improvements
across multiple domains and various base models, all with negligible
computational overhead during inference. Furthermore, our modular design
enables zero-shot generalization to new domains by simply updating the
retrieval database without retraining any components. This research enhances
the core capability of video generation systems by enabling the effective
retrieval and transfer of motion priors, facilitating the synthesis of
realistic motion dynamics.