驯服生成式视频模型以实现零样本光流提取

摘要

从视频中提取光流仍然是计算机视觉的核心问题。受大型通用模型成功的启发，我们探讨了仅针对未来帧预测训练的自监督视频模型是否能在无需微调的情况下被提示输出光流。先前从视频生成器中读取深度或光照的工作需要微调，这对于光流来说并不实用，因为光流标签稀缺且合成数据集存在模拟到现实的差距。受反事实世界模型（CWM）范式的启发，该范式通过向下一帧预测器注入微小追踪扰动并跟踪其传播来获取点对点对应关系，我们将这一思路扩展到生成式视频模型中。我们探索了几种流行的架构，发现以这种方式成功实现零样本光流提取得益于三个模型特性：（1）未来帧的分布预测（避免模糊或噪声输出）；（2）将每个时空补丁独立处理的因子化潜在变量；（3）能够基于任何未来像素子集进行随机访问解码。这些特性在最近的局部随机访问序列（LRAS）架构中独特存在。基于LRAS，我们提出了KL追踪：一种新颖的测试时程序，将局部扰动注入第一帧，单步展开模型，并计算扰动与未扰动预测分布之间的Kullback-Leibler散度。在没有任何光流特定微调的情况下，我们的方法在真实世界的TAP-Vid DAVIS数据集（端点误差相对提升16.6%）和合成的TAP-Vid Kubric数据集（相对提升4.7%）上超越了最先进的模型。我们的结果表明，对可控生成式视频模型进行反事实提示是一种可扩展且有效的高质量光流获取方法，优于监督或光度损失方法。

English

Extracting optical flow from videos remains a core computer vision problem. Motivated by the success of large general-purpose models, we ask whether frozen self-supervised video models trained only for future frame prediction can be prompted, without fine-tuning, to output flow. Prior work reading out depth or illumination from video generators required fine-tuning, which is impractical for flow where labels are scarce and synthetic datasets suffer from a sim-to-real gap. Inspired by the Counterfactual World Model (CWM) paradigm, which can obtain point-wise correspondences by injecting a small tracer perturbation into a next-frame predictor and tracking its propagation, we extend this idea to generative video models. We explore several popular architectures and find that successful zero-shot flow extraction in this manner is aided by three model properties: (1) distributional prediction of future frames (avoiding blurry or noisy outputs); (2) factorized latents that treat each spatio-temporal patch independently; and (3) random-access decoding that can condition on any subset of future pixels. These properties are uniquely present in the recent Local Random Access Sequence (LRAS) architecture. Building on LRAS, we propose KL-tracing: a novel test-time procedure that injects a localized perturbation into the first frame, rolls out the model one step, and computes the Kullback-Leibler divergence between perturbed and unperturbed predictive distributions. Without any flow-specific fine-tuning, our method outperforms state-of-the-art models on real-world TAP-Vid DAVIS dataset (16.6% relative improvement for endpoint error) and synthetic TAP-Vid Kubric (4.7% relative improvement). Our results indicate that counterfactual prompting of controllable generative video models is a scalable and effective alternative to supervised or photometric-loss approaches for high-quality flow.

驯服生成式视频模型以实现零样本光流提取

Taming generative video models for zero-shot optical flow extraction

摘要

Support