SIGMA:Sinkhorn 引導的遮罩式視頻建模
SIGMA: Sinkhorn-Guided Masked Video Modeling
July 22, 2024
作者: Mohammadreza Salehi, Michael Dorkenwald, Fida Mohammad Thoker, Efstratios Gavves, Cees G. M. Snoek, Yuki M. Asano
cs.AI
摘要
基於影片的預訓練為以前所未有的規模提供了學習強大視覺表徵的巨大潛力。最近,遮罩式影片建模方法展現出可觀的可擴展性,但由於重建預定的低級目標(如像素),在捕捉更高層次的語義方面仍有不足。為了應對這一問題,我們提出了Sinkhorn引導的遮罩式影片建模(SIGMA),這是一種新穎的影片預訓練方法,同時使用投影網路聯合學習影片模型以及目標特徵空間。然而,這個簡單的修改意味著常規的L2重建損失將導致微不足道的解決方案,因為兩個網路都是聯合優化的。作為解決方案,我們將時空管道的特徵均勻分佈在有限數量的可學習群集中。通過將這視為最優運輸問題,我們在批次中強制實現生成特徵的高熵,將語義和時間意義融入特徵空間。所得的群集分配被用作對稱預測任務的目標,其中影片模型預測投影網路的群集分配,反之亦然。在三個基準測試中跨十個數據集的實驗結果驗證了SIGMA在學習更高效、具有時間意識和強大的影片表徵方面的有效性,並改進了最先進的方法。我們的項目網站及代碼可在以下網址找到:https://quva-lab.github.io/SIGMA。
English
Video-based pretraining offers immense potential for learning strong visual
representations on an unprecedented scale. Recently, masked video modeling
methods have shown promising scalability, yet fall short in capturing
higher-level semantics due to reconstructing predefined low-level targets such
as pixels. To tackle this, we present Sinkhorn-guided Masked Video Modelling
(SIGMA), a novel video pretraining method that jointly learns the video model
in addition to a target feature space using a projection network. However, this
simple modification means that the regular L2 reconstruction loss will lead to
trivial solutions as both networks are jointly optimized. As a solution, we
distribute features of space-time tubes evenly across a limited number of
learnable clusters. By posing this as an optimal transport problem, we enforce
high entropy in the generated features across the batch, infusing semantic and
temporal meaning into the feature space. The resulting cluster assignments are
used as targets for a symmetric prediction task where the video model predicts
cluster assignment of the projection network and vice versa. Experimental
results on ten datasets across three benchmarks validate the effectiveness of
SIGMA in learning more performant, temporally-aware, and robust video
representations improving upon state-of-the-art methods. Our project website
with code is available at: https://quva-lab.github.io/SIGMA.Summary
AI-Generated Summary