LLM2CLIP：強大語言模型解鎖更豐富的視覺表示

摘要

CLIP是當今最重要的多模式基礎模型之一。CLIP的能力來自於什麼？自然語言提供的豐富監督訊號，這是人類知識的載體，塑造了一個強大的跨模式表示空間。然而，隨著大型語言模型（LLMs）如GPT-4和LLaMA的快速進展，語言理解和生成的界限不斷被推進。這帶出了一個有趣的問題：LLMs的能力能否被利用來進一步改善多模式表示學習？將LLMs納入CLIP中的潛在好處是顯而易見的。LLMs強大的文本理解能力可以從根本上改善CLIP處理圖像標題的能力，徹底提升其處理長篇和複雜文本的能力，這是普通CLIP已知的局限。此外，LLMs在龐大的文本語料庫上進行訓練，擁有開放世界知識。這使它們能夠在訓練過程中擴展標題信息，提高學習過程的效率。在本文中，我們提出了LLM2CLIP，這是一種採用LLMs潛力的新方法。通過在對比學習中在標題空間中微調LLM，我們將其文本能力提取到輸出嵌入中，顯著提高了輸出層的文本可區分性。然後，我們設計了一個高效的訓練過程，其中經過微調的LLM作為CLIP的視覺編碼器的強大教師。由於LLM的存在，我們現在可以納入更長更複雜的標題，而不受普通CLIP文本編碼器的上下文窗口和能力限制。我們的實驗表明，這種方法在跨模式任務中帶來了顯著的改進。

English

CLIP is one of the most important multimodal foundational models today. What powers CLIP's capabilities? The rich supervision signals provided by natural language, the carrier of human knowledge, shape a powerful cross-modal representation space. However, with the rapid advancements in large language models LLMs like GPT-4 and LLaMA, the boundaries of language comprehension and generation are continually being pushed. This raises an intriguing question: can the capabilities of LLMs be harnessed to further improve multimodal representation learning? The potential benefits of incorporating LLMs into CLIP are clear. LLMs' strong textual understanding can fundamentally improve CLIP's ability to handle image captions, drastically enhancing its ability to process long and complex texts, a well-known limitation of vanilla CLIP. Moreover, LLMs are trained on a vast corpus of text, possessing open-world knowledge. This allows them to expand on caption information during training, increasing the efficiency of the learning process. In this paper, we propose LLM2CLIP, a novel approach that embraces the power of LLMs to unlock CLIP's potential. By fine-tuning the LLM in the caption space with contrastive learning, we extract its textual capabilities into the output embeddings, significantly improving the output layer's textual discriminability. We then design an efficient training process where the fine-tuned LLM acts as a powerful teacher for CLIP's visual encoder. Thanks to the LLM's presence, we can now incorporate longer and more complex captions without being restricted by vanilla CLIP's text encoder's context window and ability limitations. Our experiments demonstrate that this approach brings substantial improvements in cross-modal tasks.

LLM2CLIP：強大語言模型解鎖更豐富的視覺表示

LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation

摘要

Support