確率的な位置でマスクされたトークンを予測することが、マスクされた画像モデリングを改善する

要旨

自己教師あり学習は、有用な表現を学習するための事前タスクを構築することで、ラベルなしデータから学習を可能にする深層学習の有望なパラダイムです。自然言語処理では、マスク言語モデリング（MLM）が主要な事前タスクとなっていますが、コンピュータビジョンではこれに相当するマスク画像モデリング（MIM）が存在します。しかし、MIMは正確な位置での意味内容を予測する必要があるため、課題があります。例えば、不完全な犬の画像が与えられた場合、尾があると推測できますが、その正確な位置を特定することはできません。本研究では、この課題に対処するために、位置の不確実性をモデルに組み込んだ確率的モデルであるFlexPredictを提案します。具体的には、確率的にマスクされたトークンの位置をモデルに条件付けし、位置の不確実性に対してより頑健な特徴を学習するよう導きます。このアプローチにより、様々なタスクにおける下流性能が向上します。例えば、MIMベースラインと比較して、FlexPredictはViT-Bを使用したImageNet線形プローブで1.6%、ViT-Lを使用した半教師ありビデオセグメンテーションで2.5%の性能向上をもたらします。

English

Self-supervised learning is a promising paradigm in deep learning that enables learning from unlabeled data by constructing pretext tasks that require learning useful representations. In natural language processing, the dominant pretext task has been masked language modeling (MLM), while in computer vision there exists an equivalent called Masked Image Modeling (MIM). However, MIM is challenging because it requires predicting semantic content in accurate locations. E.g, given an incomplete picture of a dog, we can guess that there is a tail, but we cannot determine its exact location. In this work, we propose FlexPredict, a stochastic model that addresses this challenge by incorporating location uncertainty into the model. Specifically, we condition the model on stochastic masked token positions to guide the model toward learning features that are more robust to location uncertainties. Our approach improves downstream performance on a range of tasks, e.g, compared to MIM baselines, FlexPredict boosts ImageNet linear probing by 1.6% with ViT-B and by 2.5% for semi-supervised video segmentation using ViT-L.

確率的な位置でマスクされたトークンを予測することが、マスクされた画像モデリングを改善する

Predicting masked tokens in stochastic locations improves masked image modeling

要旨

Support