MotiMotion: 시각적 추론을 활용한 동작 제어 비디오 생성

초록

현재 모션 제어 기반 이미지-투-비디오 생성 모델은 사용자가 제공한 궤적을 엄격히 따르도록 설계되어 있으며, 이러한 궤적은 종종 희소하고 부정확하며 인과적으로 불완전합니다. 이러한 의존성은 특히 이차적 인과 결과를 놓침으로써 부자연스럽거나 타당하지 않은 결과를 초래하는 경우가 많습니다. 이 문제를 해결하기 위해, 우리는 모션 제어를 추론 후 생성(reasoning-then-generation) 문제로 재구성하는 새로운 프레임워크인 MotiMotion을 제안합니다. 인과적으로 근거가 있고 상식에 부합하는 상호작용을 장려하기 위해, 우리는 학습이 필요 없는 비전-언어 추론기를 활용하여 주요 궤적의 이미지 공간 좌표를 정제하고 타당한 이차적 움직임을 추론합니다. 또한 움직임의 자연스러움을 더욱 개선하기 위해, 신뢰도를 고려한 제어 방식을 제안하여 유도 강도를 조절함으로써, 모델이 높은 신뢰도의 계획을 밀접히 따르면서 낮은 신뢰도의 입력에 대해서는 내부 생성 사전 지식을 활용하여 아티팩트를 보정할 수 있도록 합니다. 체계적인 평가를 지원하기 위해, 우리는 움직임에 의해 새로운 이벤트가 촉발되는 상호작용 중심 장면으로 구성된 새로운 이미지-투-비디오 벤치마크인 MotiBench를 구축했습니다. MotiBench에 대한 VLM 기반 평가와 인간 연구 모두에서 MotiMotion이 더 타당한 객체 행동과 상호작용을 보여주는 비디오를 생성하며, 기존 접근 방식보다 선호됨을 입증했습니다.

English

Current motion-controlled image-to-video generation models rigidly follow user-provided trajectories that are often sparse, imprecise, and causally incomplete. Such reliance often yields unnatural or implausible outcomes, especially by missing secondary causal consequences. To address this, we introduce MotiMotion, a novel framework that reformulates motion control as a reasoning-then-generation problem. To encourage causally grounded and commonsense-consistent interactions, we leverage a training-free vision-language reasoner to refine image-space coordinates of primary trajectories and to hallucinate plausible secondary motions. To further improve motion naturalness, we propose a confidence-aware control scheme that modulates guidance strength, enabling the model to closely follow high-confidence plans while correcting artifacts under low-confidence inputs with its internal generative priors. To support systematic evaluation, we curate a new image-to-video benchmark, MotiBench, consisting of interaction-centric scenes where new events are triggered by motion. Both VLM-based evaluation and a human study on MotiBench demonstrate that MotiMotion produces videos with more plausible object behaviors and interaction, and is preferred over existing approaches.