ゼロショット光フロー抽出のための生成的ビデオモデルの制御

要旨

ビデオからオプティカルフローを抽出することは、依然としてコンピュータビジョンの核心的な課題である。大規模な汎用モデルの成功に触発され、未来フレーム予測のみを目的として訓練された凍結された自己教師ありビデオモデルが、ファインチューニングなしでフローを出力するように促すことができるかどうかを検討する。以前の研究では、ビデオ生成器から深度や照明を読み取るためにファインチューニングが必要であったが、フローではラベルが不足しており、合成データセットはシミュレーションと現実のギャップに悩まされているため、これは実用的ではない。カウンターファクチュアル・ワールドモデル（CWM）パラダイムに着想を得て、次のフレーム予測器に小さなトレーサー摂動を注入し、その伝播を追跡することで点ごとの対応関係を得るというアイデアを、生成型ビデオモデルに拡張する。いくつかの人気のあるアーキテクチャを探索し、この方法でゼロショットのフロー抽出が成功するためには、以下の3つのモデル特性が役立つことを見出した：（1）未来フレームの分布予測（ぼやけたまたはノイズの多い出力を避ける）；（2）各時空間パッチを独立に扱う因子化された潜在変数；（3）任意の未来ピクセルのサブセットに条件付けできるランダムアクセスデコーディング。これらの特性は、最近のLocal Random Access Sequence（LRAS）アーキテクチャに独特に存在する。LRASを基盤として、KLトレーシングを提案する：最初のフレームに局所的な摂動を注入し、モデルを1ステップ展開し、摂動された予測分布と摂動されていない予測分布の間のカルバック・ライブラー（KL）ダイバージェンスを計算する新しいテスト時手順である。フロー固有のファインチューニングなしで、我々の手法は実世界のTAP-Vid DAVISデータセット（エンドポイントエラーで16.6%の相対的改善）および合成のTAP-Vid Kubric（4.7%の相対的改善）において、最先端のモデルを上回る。我々の結果は、制御可能な生成型ビデオモデルのカウンターファクチュアルなプロンプティングが、高品質なフローのための教師ありまたはフォトメトリック損失アプローチのスケーラブルで効果的な代替手段であることを示している。

English

Extracting optical flow from videos remains a core computer vision problem. Motivated by the success of large general-purpose models, we ask whether frozen self-supervised video models trained only for future frame prediction can be prompted, without fine-tuning, to output flow. Prior work reading out depth or illumination from video generators required fine-tuning, which is impractical for flow where labels are scarce and synthetic datasets suffer from a sim-to-real gap. Inspired by the Counterfactual World Model (CWM) paradigm, which can obtain point-wise correspondences by injecting a small tracer perturbation into a next-frame predictor and tracking its propagation, we extend this idea to generative video models. We explore several popular architectures and find that successful zero-shot flow extraction in this manner is aided by three model properties: (1) distributional prediction of future frames (avoiding blurry or noisy outputs); (2) factorized latents that treat each spatio-temporal patch independently; and (3) random-access decoding that can condition on any subset of future pixels. These properties are uniquely present in the recent Local Random Access Sequence (LRAS) architecture. Building on LRAS, we propose KL-tracing: a novel test-time procedure that injects a localized perturbation into the first frame, rolls out the model one step, and computes the Kullback-Leibler divergence between perturbed and unperturbed predictive distributions. Without any flow-specific fine-tuning, our method outperforms state-of-the-art models on real-world TAP-Vid DAVIS dataset (16.6% relative improvement for endpoint error) and synthetic TAP-Vid Kubric (4.7% relative improvement). Our results indicate that counterfactual prompting of controllable generative video models is a scalable and effective alternative to supervised or photometric-loss approaches for high-quality flow.

ゼロショット光フロー抽出のための生成的ビデオモデルの制御

Taming generative video models for zero-shot optical flow extraction

要旨

Support