視覚言語事前学習のための改良されたベースライン

要旨

コントラスティブ学習は、マルチモーダル表現を学習するための効率的なフレームワークとして登場しました。この分野の画期的な研究であるCLIPは、コントラスティブ損失を用いてペア画像-テキストデータを学習することで、印象的な結果を達成しました。最近の研究では、自己教師あり学習に着想を得た非コントラスティブ損失を追加することで、CLIPを上回る改善を主張しています。しかし、これらの追加損失の貢献を、データ拡張や正則化技術などの他の実装詳細から切り離すことは難しい場合があります。この問題を明らかにするため、本論文ではまず、コントラスティブ学習と自己教師あり学習の最近の進展を組み合わせた複数のベースラインを提案、実装、評価します。特に、視覚的自己教師あり学習で成功が証明された損失関数を使用して、画像とテキストのモダリティを整合させます。これらのベースラインは、基本的なCLIPの実装を上回ることがわかりました。しかし、より強力なトレーニングレシピを使用すると、その優位性は消えます。実際、他の分野で人気のあるよく知られたトレーニング技術を使用することで、単純なCLIPベースラインも大幅に改善できることがわかりました。下流のゼロショットタスクでは最大25%の相対的改善が見られました。さらに、先行研究が達成した改善の大部分を補うには、画像とテキストの拡張を適用するだけで十分であることがわかりました。CLIPの改良されたトレーニングレシピを使用することで、4つの標準データセットで最先端のパフォーマンスを達成し、先行研究を一貫して上回りました（最大のデータセットでは+4%）。その一方で、実装は大幅に簡素化されています。

English

Contrastive learning has emerged as an efficient framework to learn multimodal representations. CLIP, a seminal work in this area, achieved impressive results by training on paired image-text data using the contrastive loss. Recent work claims improvements over CLIP using additional non-contrastive losses inspired from self-supervised learning. However, it is sometimes hard to disentangle the contribution of these additional losses from other implementation details, e.g., data augmentation or regularization techniques, used to train the model. To shed light on this matter, in this paper, we first propose, implement and evaluate several baselines obtained by combining contrastive learning with recent advances in self-supervised learning. In particular, we use the loss functions that were proven successful for visual self-supervised learning to align image and text modalities. We find that these baselines outperform a basic implementation of CLIP. However, when a stronger training recipe is employed, the advantage disappears. Indeed, we find that a simple CLIP baseline can also be improved substantially, up to a 25% relative improvement on downstream zero-shot tasks, by using well-known training techniques that are popular in other subfields. Moreover, we discover that it is enough to apply image and text augmentations to make up for most of the improvement attained by prior works. With our improved training recipe for CLIP, we obtain state-of-the-art performance on four standard datasets, and consistently outperform prior work (up to +4% on the largest dataset), while being substantially simpler.

視覚言語事前学習のための改良されたベースライン

Improved baselines for vision-language pre-training

要旨

Support