検索拡張型コントラスト視覚-テキストモデル

要旨

CLIPのような対比的な画像-テキストモデルは、多くの最先端システムの基盤を形成しています。これらのモデルは一般的な汎用概念の認識に優れていますが、事前学習データセットに稀にしか存在しない、あるいは全く含まれていない細粒度のエンティティに対しては依然として苦戦しています。そのため、これらのモデルの成功の鍵となっているのは、事前学習段階で記憶できる概念の範囲を拡大することを目的とした大規模なキュレーションされた事前学習データの使用です。本研究では、細粒度の知識を直接モデルのパラメータにエンコードする代わりに、外部メモリからその知識を検索するようにモデルを訓練するという代替手法を探求します。具体的には、既存の視覚-テキストモデルに、推論時にメモリからクロスモーダルに検索された情報を用いて埋め込みを精緻化する能力を付与することを提案します。これにより、ゼロショット予測が大幅に向上します。注目すべきは、凍結されたCLIPの上に軽量な単層の融合トランスフォーマーを追加するだけでこれが実現できることを示している点です。実験により、検索機能を強化した対比学習（RECO）が、CLIPの性能をいくつかの困難な細粒度タスクで大幅に向上させることが検証されました。例えば、Stanford Carsでは+10.9、CUB-2011では+10.2、最近のOVENベンチマークでは+7.3の改善が見られました。

English

Contrastive image-text models such as CLIP form the building blocks of many state-of-the-art systems. While they excel at recognizing common generic concepts, they still struggle on fine-grained entities which are rare, or even absent from the pre-training dataset. Hence, a key ingredient to their success has been the use of large-scale curated pre-training data aiming at expanding the set of concepts that they can memorize during the pre-training stage. In this work, we explore an alternative to encoding fine-grained knowledge directly into the model's parameters: we instead train the model to retrieve this knowledge from an external memory. Specifically, we propose to equip existing vision-text models with the ability to refine their embedding with cross-modal retrieved information from a memory at inference time, which greatly improves their zero-shot predictions. Remarkably, we show that this can be done with a light-weight, single-layer, fusion transformer on top of a frozen CLIP. Our experiments validate that our retrieval-enhanced contrastive (RECO) training improves CLIP performance substantially on several challenging fine-grained tasks: for example +10.9 on Stanford Cars, +10.2 on CUB-2011 and +7.3 on the recent OVEN benchmark.

検索拡張型コントラスト視覚-テキストモデル

Retrieval-Enhanced Contrastive Vision-Text Models

要旨

Support