ChatPaper.aiChatPaper

**《动时而生:基于双时钟去噪的无训练运动控制视频生成》**

Time-to-Move: Training-Free Motion Controlled Video Generation via Dual-Clock Denoising

November 9, 2025
作者: Assaf Singer, Noam Rotstein, Amir Mann, Ron Kimmel, Or Litany
cs.AI

摘要

基于扩散模型的视频生成技术能够创作逼真视频,但现有基于图像和文本的条件控制方法难以实现精确的运动调控。此前针对运动条件合成的方法通常需要对特定模型进行微调,这种方案计算成本高昂且具有局限性。我们提出Time-to-Move(TTM)——一种无需训练、即插即用的运动与外观协同控制视频生成框架,适用于图像到视频(I2V)扩散模型。我们的核心思路是利用通过剪贴拖拽或基于深度的重投影等用户友好操作获得的粗略参考动画。受SDEdit利用粗粒度布局线索进行图像编辑的启发,我们将这些粗略动画视为运动提示信号,并将该机制适配到视频领域。通过图像条件控制保持外观一致性,同时引入双时钟去噪策略——这种区域自适应方法在运动指定区域强制实现强对齐,同时在其他区域保持灵活性,从而在用户意图忠实度与自然动态表现之间取得平衡。这种对采样过程的轻量级修改无需额外训练或运行时成本,并可兼容任何骨干模型。在物体运动和相机运动基准测试上的大量实验表明,TTM在真实感和运动控制方面达到或超越了现有基于训练的基线方法。此外,TTM还引入了独特能力:通过像素级条件控制实现精确的外观调控,突破了纯文本提示的局限性。欢迎访问项目主页查看视频示例和代码:https://time-to-move.github.io/。
English
Diffusion-based video generation can create realistic videos, yet existing image- and text-based conditioning fails to offer precise motion control. Prior methods for motion-conditioned synthesis typically require model-specific fine-tuning, which is computationally expensive and restrictive. We introduce Time-to-Move (TTM), a training-free, plug-and-play framework for motion- and appearance-controlled video generation with image-to-video (I2V) diffusion models. Our key insight is to use crude reference animations obtained through user-friendly manipulations such as cut-and-drag or depth-based reprojection. Motivated by SDEdit's use of coarse layout cues for image editing, we treat the crude animations as coarse motion cues and adapt the mechanism to the video domain. We preserve appearance with image conditioning and introduce dual-clock denoising, a region-dependent strategy that enforces strong alignment in motion-specified regions while allowing flexibility elsewhere, balancing fidelity to user intent with natural dynamics. This lightweight modification of the sampling process incurs no additional training or runtime cost and is compatible with any backbone. Extensive experiments on object and camera motion benchmarks show that TTM matches or exceeds existing training-based baselines in realism and motion control. Beyond this, TTM introduces a unique capability: precise appearance control through pixel-level conditioning, exceeding the limits of text-only prompting. Visit our project page for video examples and code: https://time-to-move.github.io/.
PDF532December 1, 2025