ChatPaper.aiChatPaper

MotiMotion: 基于视觉推理的运动控制视频生成

MotiMotion: Motion-Controlled Video Generation with Visual Reasoning

May 21, 2026
作者: Lee Hsin-Ying, Hanwen Jiang, Yiqun Mei, Jing Shi, Ming-Hsuan Yang, Zhixin Shu
cs.AI

摘要

当前基于运动控制的图像到视频生成模型严格遵循用户提供的轨迹,而这些轨迹往往稀疏、不精确且因果不完整。这种依赖常导致结果不自然或不合理,尤其是无法体现次要因果后果。为解决这一问题,我们提出MotiMotion——一个将运动控制重新定义为"先推理后生成"问题的新框架。为促进符合因果关系和常识的交互,我们利用无需训练的视觉语言推理器,细化主轨迹的图像空间坐标,并虚构合理的次要运动。为进一步提升运动自然度,我们提出一种置信度感知控制方案,通过调节引导强度,使模型在高置信度规划下紧密遵循指令,同时在低置信度输入下利用其内部生成先验修正伪影。为支持系统性评估,我们构建了新的图像到视频基准MotiBench,包含以交互为核心的场景,其中运动触发新事件。基于VLM的评估及针对MotiBench的人类研究均表明,MotiMotion生成的视频具有更合理的物体行为与交互,并优于现有方法。
English
Current motion-controlled image-to-video generation models rigidly follow user-provided trajectories that are often sparse, imprecise, and causally incomplete. Such reliance often yields unnatural or implausible outcomes, especially by missing secondary causal consequences. To address this, we introduce MotiMotion, a novel framework that reformulates motion control as a reasoning-then-generation problem. To encourage causally grounded and commonsense-consistent interactions, we leverage a training-free vision-language reasoner to refine image-space coordinates of primary trajectories and to hallucinate plausible secondary motions. To further improve motion naturalness, we propose a confidence-aware control scheme that modulates guidance strength, enabling the model to closely follow high-confidence plans while correcting artifacts under low-confidence inputs with its internal generative priors. To support systematic evaluation, we curate a new image-to-video benchmark, MotiBench, consisting of interaction-centric scenes where new events are triggered by motion. Both VLM-based evaluation and a human study on MotiBench demonstrate that MotiMotion produces videos with more plausible object behaviors and interaction, and is preferred over existing approaches.