点提示:基于视频扩散模型的反事实追踪
Point Prompting: Counterfactual Tracking with Video Diffusion Models
October 13, 2025
作者: Ayush Shrivastava, Sanyam Mehta, Daniel Geng, Andrew Owens
cs.AI
摘要
追踪器与视频生成器解决的是紧密相关的问题:前者分析运动,后者则合成运动。我们揭示了这一联系使得预训练的视频扩散模型能够通过简单地提示其随时间推移视觉标记点,实现零样本点追踪。我们在查询点放置一个颜色独特的标记,然后从中等噪声水平重新生成视频的其余部分。这一过程将标记跨帧传播,描绘出点的运动轨迹。为了确保在这种反事实生成中标记始终可见,尽管自然视频中此类标记并不常见,我们采用未编辑的初始帧作为负向提示。通过对多种图像条件视频扩散模型的实验,我们发现这些“涌现”的追踪轨迹超越了先前的零样本方法,并在遮挡情况下持续有效,其表现往往可与专门的自监督模型相媲美。
English
Trackers and video generators solve closely related problems: the former
analyze motion, while the latter synthesize it. We show that this connection
enables pretrained video diffusion models to perform zero-shot point tracking
by simply prompting them to visually mark points as they move over time. We
place a distinctively colored marker at the query point, then regenerate the
rest of the video from an intermediate noise level. This propagates the marker
across frames, tracing the point's trajectory. To ensure that the marker
remains visible in this counterfactual generation, despite such markers being
unlikely in natural videos, we use the unedited initial frame as a negative
prompt. Through experiments with multiple image-conditioned video diffusion
models, we find that these "emergent" tracks outperform those of prior
zero-shot methods and persist through occlusions, often obtaining performance
that is competitive with specialized self-supervised models.