時移:基於雙時鐘去噪的無訓練運動控制影片生成 (註:此標題翻譯採用以下策略: 1. "Time-to-Move" 意譯為「時移」,既保留時間維度的動態感,又符合中文技術術語習慣 2. "Training-Free" 準確譯為「無訓練」,強調無需額外訓練的技術特性 3. "Motion Controlled" 採用「運動控制」這一計算機視覺領域標準譯法 4. "Dual-Clock Denoising" 直譯為「雙時鐘去噪」,保留原文比喻意象的同時確保技術準確性 5. 整體採用「技術方法+應用領域」的中文學術標題慣用結構)
Time-to-Move: Training-Free Motion Controlled Video Generation via Dual-Clock Denoising
November 9, 2025
作者: Assaf Singer, Noam Rotstein, Amir Mann, Ron Kimmel, Or Litany
cs.AI
摘要
基於擴散模型的影片生成技術雖能創造逼真影片,但現有基於圖像與文本的條件控制方法難以實現精確運動調控。過往的運動條件合成方法通常需針對特定模型進行微調,這種方式不僅計算成本高昂且適用性有限。我們提出Time-to-Move(TTM)——一種無需訓練、即插即用的框架,可透過圖像到影片(I2V)擴散模型實現運動與外觀的雙重控制。我們的核心思路是採用透過剪貼拖拽或深度重投影等用戶友好操作獲得的粗略參考動畫。受SDEdit利用粗粒度佈局線索進行圖像編輯的啟發,我們將此類粗糙動畫視為運動提示信號,並將該機制適配至影片領域。我們通過圖像條件保持外觀一致性,並提出雙時鐘去噪策略——這種區域依賴性方法在運動指定區域實施強對齊約束,同時允許其他區域保持動態靈活性,從而平衡用戶意圖還原度與自然動態效果。這種對採樣過程的輕量級修改無需額外訓練或運算成本,且兼容任何骨幹模型。在物體運動與鏡頭運動基準測試中的大量實驗表明,TTM在真實感與運動控制精度上達到甚至超越現有需訓練的基準方法。此外,TTM更具備獨特優勢:透過像素級條件控制實現精確外觀調控,突破純文本提示的侷限性。歡迎造訪我們的項目頁面觀看影片範例與取得程式碼:https://time-to-move.github.io/。
English
Diffusion-based video generation can create realistic videos, yet existing image- and text-based conditioning fails to offer precise motion control. Prior methods for motion-conditioned synthesis typically require model-specific fine-tuning, which is computationally expensive and restrictive. We introduce Time-to-Move (TTM), a training-free, plug-and-play framework for motion- and appearance-controlled video generation with image-to-video (I2V) diffusion models. Our key insight is to use crude reference animations obtained through user-friendly manipulations such as cut-and-drag or depth-based reprojection. Motivated by SDEdit's use of coarse layout cues for image editing, we treat the crude animations as coarse motion cues and adapt the mechanism to the video domain. We preserve appearance with image conditioning and introduce dual-clock denoising, a region-dependent strategy that enforces strong alignment in motion-specified regions while allowing flexibility elsewhere, balancing fidelity to user intent with natural dynamics. This lightweight modification of the sampling process incurs no additional training or runtime cost and is compatible with any backbone. Extensive experiments on object and camera motion benchmarks show that TTM matches or exceeds existing training-based baselines in realism and motion control. Beyond this, TTM introduces a unique capability: precise appearance control through pixel-level conditioning, exceeding the limits of text-only prompting. Visit our project page for video examples and code: https://time-to-move.github.io/.