MoRight:精准运动控制新标杆
MoRight: Motion Control Done Right
April 8, 2026
作者: Shaowei Liu, Xuanchi Ren, Tianchang Shen, Huan Ling, Saurabh Gupta, Shenlong Wang, Sanja Fidler, Jun Gao
cs.AI
摘要
生成运动控制视频——即用户通过指定动作在自由选择的视角下驱动符合物理规律的场景动态——需要具备两项核心能力:(1)解耦运动控制,允许用户分别控制物体运动并调整摄像机视角;(2)运动因果性,确保用户驱动的动作能触发其他物体的连贯反应而非简单位移像素。现有方法在这两方面均存在不足:它们将相机与物体运动混叠为单一跟踪信号,并将运动视为运动学位移而忽略物体间的因果关系。我们提出MoRight这一统一框架,通过解耦运动建模解决上述局限。物体运动在静态标准视角下被定义,并通过时序跨视角注意力机制转移至任意目标摄像机视角,从而实现相机与物体控制的解耦。我们进一步将运动分解为主动(用户驱动)与被动(结果响应)分量,训练模型从数据中学习运动因果关系。在推理阶段,用户既可提供主动运动由MoRight预测响应结果(正向推理),也可指定期望的被动效果由MoRight反推合理驱动动作(逆向推理),同时全程支持自由调整摄像机视角。在三个基准测试上的实验表明,该方法在生成质量、运动可控性和交互感知方面均达到最先进性能。
English
Generating motion-controlled videos--where user-specified actions drive physically plausible scene dynamics under freely chosen viewpoints--demands two capabilities: (1) disentangled motion control, allowing users to separately control the object motion and adjust camera viewpoint; and (2) motion causality, ensuring that user-driven actions trigger coherent reactions from other objects rather than merely displacing pixels. Existing methods fall short on both fronts: they entangle camera and object motion into a single tracking signal and treat motion as kinematic displacement without modeling causal relationships between object motion. We introduce MoRight, a unified framework that addresses both limitations through disentangled motion modeling. Object motion is specified in a canonical static-view and transferred to an arbitrary target camera viewpoint via temporal cross-view attention, enabling disentangled camera and object control. We further decompose motion into active (user-driven) and passive (consequence) components, training the model to learn motion causality from data. At inference, users can either supply active motion and MoRight predicts consequences (forward reasoning), or specify desired passive outcomes and MoRight recovers plausible driving actions (inverse reasoning), all while freely adjusting the camera viewpoint. Experiments on three benchmarks demonstrate state-of-the-art performance in generation quality, motion controllability, and interaction awareness.