驯服生成视频模型以实现零样本光流提取

摘要

從視頻中提取光流仍然是計算機視覺領域的核心問題。受到大型通用模型成功的啟發，我們探討了僅為未來幀預測訓練的凍結自監督視頻模型是否能在不進行微調的情況下被提示輸出光流。先前從視頻生成器中讀取深度或光照的工作需要進行微調，這對於光流來說是不切實際的，因為光流的標籤稀缺且合成數據集存在模擬與現實之間的差距。受反事實世界模型（CWM）範式的啟發，該範式可以通過在下一幀預測器中注入一個小的追蹤擾動並跟踪其傳播來獲得點對點對應關係，我們將這一想法擴展到生成式視頻模型中。我們探索了幾種流行的架構，發現以這種方式成功進行零樣本光流提取得益於三個模型特性：（1）未來幀的分佈預測（避免模糊或噪聲輸出）；（2）將每個時空補丁獨立處理的分解潛變量；（3）可以基於任何未來像素子集進行條件化的隨機訪問解碼。這些特性在最近的局部隨機訪問序列（LRAS）架構中獨特地存在。基於LRAS，我們提出了KL追蹤：一種新穎的測試時程序，該程序在第一幀中注入局部擾動，將模型推進一步，並計算擾動與未擾動預測分佈之間的Kullback-Leibler散度。在沒有任何光流特定微調的情況下，我們的方法在真實世界的TAP-Vid DAVIS數據集（端點誤差相對提高了16.6%）和合成的TAP-Vid Kubric數據集（相對提高了4.7%）上超越了最先進的模型。我們的結果表明，對於高質量光流，可控生成式視頻模型的反事實提示是一種可擴展且有效的替代方案，相較於監督或光度損失方法。

English

Extracting optical flow from videos remains a core computer vision problem. Motivated by the success of large general-purpose models, we ask whether frozen self-supervised video models trained only for future frame prediction can be prompted, without fine-tuning, to output flow. Prior work reading out depth or illumination from video generators required fine-tuning, which is impractical for flow where labels are scarce and synthetic datasets suffer from a sim-to-real gap. Inspired by the Counterfactual World Model (CWM) paradigm, which can obtain point-wise correspondences by injecting a small tracer perturbation into a next-frame predictor and tracking its propagation, we extend this idea to generative video models. We explore several popular architectures and find that successful zero-shot flow extraction in this manner is aided by three model properties: (1) distributional prediction of future frames (avoiding blurry or noisy outputs); (2) factorized latents that treat each spatio-temporal patch independently; and (3) random-access decoding that can condition on any subset of future pixels. These properties are uniquely present in the recent Local Random Access Sequence (LRAS) architecture. Building on LRAS, we propose KL-tracing: a novel test-time procedure that injects a localized perturbation into the first frame, rolls out the model one step, and computes the Kullback-Leibler divergence between perturbed and unperturbed predictive distributions. Without any flow-specific fine-tuning, our method outperforms state-of-the-art models on real-world TAP-Vid DAVIS dataset (16.6% relative improvement for endpoint error) and synthetic TAP-Vid Kubric (4.7% relative improvement). Our results indicate that counterfactual prompting of controllable generative video models is a scalable and effective alternative to supervised or photometric-loss approaches for high-quality flow.

驯服生成视频模型以实现零样本光流提取

Taming generative video models for zero-shot optical flow extraction

摘要

Support