SIGMA:Sinkhorn引导的遮罩视频建模
SIGMA: Sinkhorn-Guided Masked Video Modeling
July 22, 2024
作者: Mohammadreza Salehi, Michael Dorkenwald, Fida Mohammad Thoker, Efstratios Gavves, Cees G. M. Snoek, Yuki M. Asano
cs.AI
摘要
基于视频的预训练为在前所未有的规模上学习强大的视觉表示提供了巨大潜力。最近,遮罩视频建模方法显示出可扩展性的潜力,但由于重建预定义的低级目标(如像素),在捕捉更高级语义方面存在不足。为了解决这个问题,我们提出了Sinkhorn引导的遮罩视频建模(SIGMA),这是一种新颖的视频预训练方法,它通过投影网络共同学习视频模型以及目标特征空间。然而,这种简单修改意味着常规的L2重建损失会导致微不足道的解决方案,因为两个网络都是联合优化的。为了解决这个问题,我们将时空管道的特征均匀分布在有限数量的可学习聚类中。通过将其视为最优传输问题,我们在批处理中强制生成特征的高熵,将语义和时间含义融入特征空间。由此产生的聚类分配被用作对称预测任务的目标,其中视频模型预测投影网络的聚类分配,反之亦然。在三个基准测试中跨十个数据集的实验结果验证了SIGMA在学习更高性能、具有时间意识和稳健的视频表示方面的有效性,超越了最先进的方法。我们的项目网站和代码可在以下网址找到:https://quva-lab.github.io/SIGMA。
English
Video-based pretraining offers immense potential for learning strong visual
representations on an unprecedented scale. Recently, masked video modeling
methods have shown promising scalability, yet fall short in capturing
higher-level semantics due to reconstructing predefined low-level targets such
as pixels. To tackle this, we present Sinkhorn-guided Masked Video Modelling
(SIGMA), a novel video pretraining method that jointly learns the video model
in addition to a target feature space using a projection network. However, this
simple modification means that the regular L2 reconstruction loss will lead to
trivial solutions as both networks are jointly optimized. As a solution, we
distribute features of space-time tubes evenly across a limited number of
learnable clusters. By posing this as an optimal transport problem, we enforce
high entropy in the generated features across the batch, infusing semantic and
temporal meaning into the feature space. The resulting cluster assignments are
used as targets for a symmetric prediction task where the video model predicts
cluster assignment of the projection network and vice versa. Experimental
results on ten datasets across three benchmarks validate the effectiveness of
SIGMA in learning more performant, temporally-aware, and robust video
representations improving upon state-of-the-art methods. Our project website
with code is available at: https://quva-lab.github.io/SIGMA.Summary
AI-Generated Summary