在随机位置预测掩码标记可改善掩码图像建模。

摘要

自监督学习是深度学习中一种有前途的范式，它通过构建需要学习有用表示的假任务，使模型能够从无标签数据中学习。在自然语言处理中，主要的假任务是掩码语言建模（MLM），而在计算机视觉中存在一个类似的任务叫做掩码图像建模（MIM）。然而，MIM 面临挑战，因为它需要准确预测语义内容的位置。例如，给定一张不完整的狗的图片，我们可以猜测有一个尾巴，但无法确定其确切位置。在这项工作中，我们提出了 FlexPredict，这是一个能够解决这一挑战的随机模型，它将位置不确定性纳入模型中。具体而言，我们将模型条件设置为随机掩码标记位置，以引导模型学习更能抵抗位置不确定性的特征。我们的方法提升了各种任务的下游性能，例如，与 MIM 基线相比，FlexPredict 在使用 ViT-B 进行 ImageNet 线性探测时提高了 1.6%，在使用 ViT-L 进行半监督视频分割时提高了 2.5%。

English

Self-supervised learning is a promising paradigm in deep learning that enables learning from unlabeled data by constructing pretext tasks that require learning useful representations. In natural language processing, the dominant pretext task has been masked language modeling (MLM), while in computer vision there exists an equivalent called Masked Image Modeling (MIM). However, MIM is challenging because it requires predicting semantic content in accurate locations. E.g, given an incomplete picture of a dog, we can guess that there is a tail, but we cannot determine its exact location. In this work, we propose FlexPredict, a stochastic model that addresses this challenge by incorporating location uncertainty into the model. Specifically, we condition the model on stochastic masked token positions to guide the model toward learning features that are more robust to location uncertainties. Our approach improves downstream performance on a range of tasks, e.g, compared to MIM baselines, FlexPredict boosts ImageNet linear probing by 1.6% with ViT-B and by 2.5% for semi-supervised video segmentation using ViT-L.

在随机位置预测掩码标记可改善掩码图像建模。

Predicting masked tokens in stochastic locations improves masked image modeling

摘要

Support