SILC: 自己蒸留による視覚言語事前学習の改善

要旨

ウェブスケールの画像キャプションデータセットを用いた画像-テキスト事前学習は、CLIPとその派生モデルの成功により、オープン語彙分類および検索モデルのデフォルト手法となっています。いくつかの研究では、CLIPの特徴量を密な予測タスクに適用し、オープンセット能力の出現を示しています。しかし、コントラスティブ目的関数は画像とテキストの整合性にのみ焦点を当てており、密な予測タスクのための画像特徴学習を促進しません。本研究では、SILCを提案するために、コントラスティブ事前学習の追加目的として、自己蒸留による局所からグローバルな対応関係学習を簡易に追加することを提案します。指数移動平均（EMA）教師モデルから局所画像特徴を蒸留することで、分類、検索、特にセグメンテーションを含む複数のコンピュータビジョンタスクにおいてモデルの性能が大幅に向上することを示します。さらに、SILCは同じ学習時間においてベースラインと比較してスケーラビリティが優れていることを示します。我々のモデルSILCは、ゼロショット分類、少数ショット分類、画像およびテキスト検索、ゼロショットセグメンテーション、オープン語彙セグメンテーションにおいて新たな最先端の性能を達成しました。

English

Image-Text pretraining on web-scale image caption dataset has become the default recipe for open vocabulary classification and retrieval models thanks to the success of CLIP and its variants. Several works have also used CLIP features for dense prediction tasks and have shown the emergence of open-set abilities. However, the contrastive objective only focuses on image-text alignment and does not incentivise image feature learning for dense prediction tasks. In this work, we propose the simple addition of local-to-global correspondence learning by self-distillation as an additional objective for contrastive pre-training to propose SILC. We show that distilling local image features from an exponential moving average (EMA) teacher model significantly improves model performance on several computer vision tasks including classification, retrieval, and especially segmentation. We further show that SILC scales better with the same training duration compared to the baselines. Our model SILC sets a new state of the art for zero-shot classification, few shot classification, image and text retrieval, zero-shot segmentation, and open vocabulary segmentation.

SILC: 自己蒸留による視覚言語事前学習の改善

SILC: Improving Vision Language Pretraining with Self-Distillation

要旨

Support