ChatPaper.aiChatPaper

随轨而行:基于点跟踪的视频合成与运动控制

Go-with-the-Track: Video Compositing and Motion Control with Point Tracking

June 18, 2026
作者: Koichi Namekata, Yash Kant, Zhizheng Liu, Ryan D Burgert, Yuancheng Xu, Kuan Heng Lin, Emmett Steven, Julien Philip, Li Ma, Andrea Vedaldi, Paul Debevec, Ning Yu
cs.AI

摘要

电影制作要求精确的运动控制和参考图像合成——现有方法分别处理这些能力。基于点轨迹条件的图像到视频模型将内容插入限制在首帧,而参考到视频模型缺乏对参考内容跨帧整合的细粒度时空控制。 我们提出Go-with-the-Track,通过联合条件化多个参考图像和参考锚定点轨迹,统一了上述两种能力——将传统点轨迹扩展为显式建立生成帧与参考图像之间的对应关系,从而在整段视频中实现精确的合成与运动控制。 为此,我们引入了空间感知的点轨迹嵌入,该嵌入通过坐标级MLP结合时间池化,编码点轨迹坐标的完整序列。这种表示捕获了每个点轨迹的空间特征(作为唯一标识符),同时嵌入相似度与空间邻近性直接相关,增强了模型区分和关联点轨迹的能力。我们通过轻量适配器将这些点轨迹注入视频扩散Transformer,在解决像素到分块分辨率不匹配问题的同时,避免了朴素点轨迹降采样导致的显著运动细节丢失。 采用混合训练策略,在动态、静态及合成场景视频数据集上联合训练,以增强运动可控性。实验表明,Go-with-the-Track在单一模型中实现了卓越的运动与参考控制,并支持新功能:基于点轨迹驱动的多参考条件视频生成,以及针对静态和动态场景的相机控制。项目页面:https://eyeline-labs.github.io/Go-with-the-Track/
English
Filmmaking demands precise motion control and reference image compositing -- capabilities that existing methods treat separately. Point-track-conditioned image-to-video models restrict content insertion to the first frame, while reference-to-video models lack fine-grained spatial-temporal control over how reference content integrates across frames. We present Go-with-the-Track, which unifies both capabilities by jointly conditioning on multiple reference images and reference-anchored point-tracks -- extending conventional point-tracks to explicitly establish correspondences between generated frames and reference images, thus enabling precise compositing and motion control throughout the video. To achieve this, we introduce spatially-aware point-track embeddings that encode the full sequence of point-track coordinates using a coordinate-wise MLP followed by temporal pooling. This representation captures the spatial characteristics of each point-track (serving as a unique identifier), while the embedding similarity correlates directly with spatial proximity, enhancing the model's ability to distinguish and associate point-tracks. We inject these point-track embeddings into a video diffusion transformer via a lightweight adapter, resolving the pixel-to-patch resolution mismatch while avoiding the substantial motion detail loss inherent in naive point-track subsampling. We use a hybrid training strategy to train jointly on dynamic, static, and synthetic scene video datasets to boost motion controllability. Experiments demonstrate that Go-with-the-Track achieves superior motion and reference control in a single model and enables new capabilities: multi-reference conditioned video generation with point-track driven compositing, as well as camera control for both static and dynamic scenes. Project Page: https://eyeline-labs.github.io/Go-with-the-Track/