ChatPaper.aiChatPaper

MotiMotion:動作控制的影片生成與視覺推理

MotiMotion: Motion-Controlled Video Generation with Visual Reasoning

May 21, 2026
作者: Lee Hsin-Ying, Hanwen Jiang, Yiqun Mei, Jing Shi, Ming-Hsuan Yang, Zhixin Shu
cs.AI

摘要

當前基於運動控制的影像到影片生成模型,往往嚴格遵循使用者提供的軌跡,而這些軌跡通常稀疏、不精確且因果不完整。此種依賴性常導致不自然或不合理的結果,尤其容易忽略次要的因果連鎖反應。為解決此問題,我們提出MotiMotion,一個將運動控制重新構思為「先推理後生成」問題的新穎框架。為促進基於因果關係且符合常識的互動,我們利用一個免訓練的視覺語言推理器,來優化主要軌跡的影像空間座標,並推測合理的次要運動。為進一步提升運動的自然性,我們提出一種信心感知控制方案,透過調節引導強度,使模型能在高信心規劃下緊密遵循指令,同時在低信心輸入下,利用其內部生成先驗修正瑕疵。為支援系統性評估,我們建立了一個新的影像到影片基準MotiBench,其中包含以互動為核心的場景,在這些場景中,新事件由運動觸發。基於VLM的評估以及在MotiBench上進行的使用者研究均顯示,MotiMotion生成的影片在物體行為與互動上更為合理,且優於現有方法。
English
Current motion-controlled image-to-video generation models rigidly follow user-provided trajectories that are often sparse, imprecise, and causally incomplete. Such reliance often yields unnatural or implausible outcomes, especially by missing secondary causal consequences. To address this, we introduce MotiMotion, a novel framework that reformulates motion control as a reasoning-then-generation problem. To encourage causally grounded and commonsense-consistent interactions, we leverage a training-free vision-language reasoner to refine image-space coordinates of primary trajectories and to hallucinate plausible secondary motions. To further improve motion naturalness, we propose a confidence-aware control scheme that modulates guidance strength, enabling the model to closely follow high-confidence plans while correcting artifacts under low-confidence inputs with its internal generative priors. To support systematic evaluation, we curate a new image-to-video benchmark, MotiBench, consisting of interaction-centric scenes where new events are triggered by motion. Both VLM-based evaluation and a human study on MotiBench demonstrate that MotiMotion produces videos with more plausible object behaviors and interaction, and is preferred over existing approaches.