透過優化反事實來自我監督學習運動概念

摘要

估計影片中的運動是一個關鍵的電腦視覺問題，具有許多下游應用，包括可控影片生成和機器人技術。目前的解決方案主要使用合成數據進行訓練，或需要調整特定情境的啟發式方法，這從根本上限制了這些模型在現實世界中的能力。儘管最近在大規模自監督學習從影片中取得了進展，但利用這些表示進行運動估計仍然相對未被充分探索。在本研究中，我們開發了Opt-CWM，這是一種從預訓練的下一幀預測模型中進行流動和遮擋估計的自監督技術。Opt-CWM通過學習優化反事實探針來從基礎影片模型中提取運動信息，避免了在訓練無限制影片輸入時使用固定啟發式方法的需求。我們在無需標記數據的情況下，在真實世界影片的運動估計上達到了最先進的性能。

English

Estimating motion in videos is an essential computer vision problem with many downstream applications, including controllable video generation and robotics. Current solutions are primarily trained using synthetic data or require tuning of situation-specific heuristics, which inherently limits these models' capabilities in real-world contexts. Despite recent developments in large-scale self-supervised learning from videos, leveraging such representations for motion estimation remains relatively underexplored. In this work, we develop Opt-CWM, a self-supervised technique for flow and occlusion estimation from a pre-trained next-frame prediction model. Opt-CWM works by learning to optimize counterfactual probes that extract motion information from a base video model, avoiding the need for fixed heuristics while training on unrestricted video inputs. We achieve state-of-the-art performance for motion estimation on real-world videos while requiring no labeled data.

透過優化反事實來自我監督學習運動概念

Self-Supervised Learning of Motion Concepts by Optimizing Counterfactuals

摘要

Support