LLM2CLIP: 強力な言語モデルがより豊かな視覚表現を解き放つ

要旨

CLIPは今日最も重要なマルチモーダルな基盤モデルの1つです。CLIPの機能を支えているのは何でしょうか？人間の知識の運び手である自然言語によって提供される豊富な監督信号が、強力なクロスモーダル表現空間を形作っています。しかし、GPT-4やLLaMAなどの大規模言語モデル（LLM）の急速な進歩により、言語理解と生成の境界が常に em>押し広げられています。これは興味深い問いを提起します：LLMの能力を活用して、マルチモーダル表現学習をさらに向上させることは可能でしょうか？LLMをCLIPに組み込むことの潜在的な利点は明らかです。LLMの強力なテキスト理解は、画像キャプションを処理する能力を根本的に向上させ、バニラCLIPの長く複雑なテキストを処理する能力を劇的に向上させることができます。さらに、LLMは膨大なテキストコーパスで訓練されており、オープンワールドの知識を持っています。これにより、訓練中にキャプション情報を拡張し、学習プロセスの効率を向上させることができます。本論文では、LLMの力を活用してCLIPの潜在能力を引き出す新しいアプローチであるLLM2CLIPを提案します。コントラスト学習を用いてキャプション空間でLLMを微調整することで、そのテキスト能力を出力埋め込みに抽出し、出力層のテキストの識別可能性を大幅に向上させます。その後、微調整されたLLMをCLIPのビジュアルエンコーダの強力な教師として機能させる効率的なトレーニングプロセスを設計します。LLMの存在により、バニラCLIPのテキストエンコーダのコンテキストウィンドウと能力の制限に制約されることなく、より長く複雑なキャプションを組み込むことができます。私たちの実験は、このアプローチがクロスモーダルタスクで大幅な改善をもたらすことを示しています。

English

CLIP is one of the most important multimodal foundational models today. What powers CLIP's capabilities? The rich supervision signals provided by natural language, the carrier of human knowledge, shape a powerful cross-modal representation space. However, with the rapid advancements in large language models LLMs like GPT-4 and LLaMA, the boundaries of language comprehension and generation are continually being pushed. This raises an intriguing question: can the capabilities of LLMs be harnessed to further improve multimodal representation learning? The potential benefits of incorporating LLMs into CLIP are clear. LLMs' strong textual understanding can fundamentally improve CLIP's ability to handle image captions, drastically enhancing its ability to process long and complex texts, a well-known limitation of vanilla CLIP. Moreover, LLMs are trained on a vast corpus of text, possessing open-world knowledge. This allows them to expand on caption information during training, increasing the efficiency of the learning process. In this paper, we propose LLM2CLIP, a novel approach that embraces the power of LLMs to unlock CLIP's potential. By fine-tuning the LLM in the caption space with contrastive learning, we extract its textual capabilities into the output embeddings, significantly improving the output layer's textual discriminability. We then design an efficient training process where the fine-tuned LLM acts as a powerful teacher for CLIP's visual encoder. Thanks to the LLM's presence, we can now incorporate longer and more complex captions without being restricted by vanilla CLIP's text encoder's context window and ability limitations. Our experiments demonstrate that this approach brings substantial improvements in cross-modal tasks.

LLM2CLIP: 強力な言語モデルがより豊かな視覚表現を解き放つ

LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation

要旨

Support