點提示法:基於視頻擴散模型的對比追蹤
Point Prompting: Counterfactual Tracking with Video Diffusion Models
October 13, 2025
作者: Ayush Shrivastava, Sanyam Mehta, Daniel Geng, Andrew Owens
cs.AI
摘要
追踪器與視頻生成器解決的是密切相關的問題:前者分析運動,後者則合成運動。我們展示了這種聯繫使得預訓練的視頻擴散模型能夠通過簡單地提示它們在時間推移中視覺標記點來執行零樣本點追踪。我們在查詢點放置一個獨特色彩的標記,然後從中間噪聲水平重新生成視頻的其餘部分。這將標記跨幀傳播,描繪出點的軌跡。為了確保在這種反事實生成中標記保持可見,儘管這樣的標記在自然視頻中不太可能出現,我們使用未編輯的初始幀作為負面提示。通過對多個圖像條件視頻擴散模型的實驗,我們發現這些“湧現”的追踪軌跡超越了先前的零樣本方法,並在遮擋情況下持續存在,通常能獲得與專門的自監督模型相媲美的性能。
English
Trackers and video generators solve closely related problems: the former
analyze motion, while the latter synthesize it. We show that this connection
enables pretrained video diffusion models to perform zero-shot point tracking
by simply prompting them to visually mark points as they move over time. We
place a distinctively colored marker at the query point, then regenerate the
rest of the video from an intermediate noise level. This propagates the marker
across frames, tracing the point's trajectory. To ensure that the marker
remains visible in this counterfactual generation, despite such markers being
unlikely in natural videos, we use the unedited initial frame as a negative
prompt. Through experiments with multiple image-conditioned video diffusion
models, we find that these "emergent" tracks outperform those of prior
zero-shot methods and persist through occlusions, often obtaining performance
that is competitive with specialized self-supervised models.