基于3D点轨迹的生成式视频动态编辑
Generative Video Motion Editing with 3D Point Tracks
December 1, 2025
作者: Yao-Chih Lee, Zhoutong Zhang, Jiahui Huang, Jui-Hsien Wang, Joon-Young Lee, Jia-Bin Huang, Eli Shechtman, Zhengqi Li
cs.AI
摘要
相机与物体运动是视频叙事的关键要素。然而如何精确编辑这些已捕捉的运动仍是一大挑战,尤其在复杂物体运动场景下更为突出。当前基于运动控制的图像转视频(I2V)方法常因缺乏全场景上下文而难以保持编辑一致性,而视频转视频(V2V)方法虽能实现视角变换或基础物体位移,却对细粒度物体运动的控制力有限。我们提出一种轨迹约束的V2V框架,可实现相机与物体运动的联合编辑。该框架通过将视频生成模型与源视频及表征源/目标运动的三维点云轨迹进行绑定,利用这些三维轨迹建立的稀疏对应关系,将丰富上下文从源视频传递至新运动场景,同时保持时空连贯性。关键突破在于,相较于二维轨迹,三维轨迹提供的显式深度线索使模型能够解析深度层级并处理遮挡问题,从而实现精确运动编辑。通过合成数据与真实数据的双阶段训练,我们的模型支持多种运动编辑任务,包括相机/物体联合操控、运动迁移和非刚性变形,为视频编辑开辟了新的创作空间。
English
Camera and object motions are central to a video's narrative. However, precisely editing these captured motions remains a significant challenge, especially under complex object movements. Current motion-controlled image-to-video (I2V) approaches often lack full-scene context for consistent video editing, while video-to-video (V2V) methods provide viewpoint changes or basic object translation, but offer limited control over fine-grained object motion. We present a track-conditioned V2V framework that enables joint editing of camera and object motion. We achieve this by conditioning a video generation model on a source video and paired 3D point tracks representing source and target motions. These 3D tracks establish sparse correspondences that transfer rich context from the source video to new motions while preserving spatiotemporal coherence. Crucially, compared to 2D tracks, 3D tracks provide explicit depth cues, allowing the model to resolve depth order and handle occlusions for precise motion editing. Trained in two stages on synthetic and real data, our model supports diverse motion edits, including joint camera/object manipulation, motion transfer, and non-rigid deformation, unlocking new creative potential in video editing.