固定テキストエンコーダを用いた言語-画像アラインメント

要旨

現在、言語と画像のアラインメントを確立するための最も支配的なアプローチは、CLIPやその派生モデルに見られるように、テキストと画像のエンコーダを対照学習によって共同で事前学習することです。本研究では、そのような高コストな共同訓練が本当に必要かどうかを問い直します。特に、事前学習済みの固定された大規模言語モデル（LLM）が、視覚表現学習を導くのに十分なテキストエンコーダを提供できるかどうかを調査します。つまり、LLMから得られた固定テキストエンコーダを用いて、画像エンコーダのみを訓練することで言語と画像のアラインメントを学習する方法（LIFT）を提案します。驚くべきことに、包括的なベンチマークとアブレーション研究を通じて、この大幅に簡素化されたフレームワークであるLIFTが非常に有効であり、構成理解や長いキャプションを含むほとんどのシナリオでCLIPを上回り、計算効率においても大きな向上を達成することがわかりました。本研究は、LLMからのテキスト埋め込みが視覚学習をどのように導くかを体系的に探る第一歩を踏み出し、言語アラインメントされた視覚表現を学習するための代替的な設計選択を示唆しています。

English

Currently, the most dominant approach to establishing language-image alignment is to pre-train text and image encoders jointly through contrastive learning, such as CLIP and its variants. In this work, we question whether such a costly joint training is necessary. In particular, we investigate if a pre-trained fixed large language model (LLM) offers a good enough text encoder to guide visual representation learning. That is, we propose to learn Language-Image alignment with a Fixed Text encoder (LIFT) from an LLM by training only the image encoder. Somewhat surprisingly, through comprehensive benchmarking and ablation studies, we find that this much simplified framework LIFT is highly effective and it outperforms CLIP in most scenarios that involve compositional understanding and long captions, while achieving considerable gains in computational efficiency. Our work takes a first step towards systematically exploring how text embeddings from LLMs can guide visual learning and suggests an alternative design choice for learning language-aligned visual representations.

固定テキストエンコーダを用いた言語-画像アラインメント

Language-Image Alignment with Fixed Text Encoders

要旨

Support