固定文本编码器的语言-图像对齐

摘要

当前，建立语言-图像对齐的主流方法是通过对比学习联合预训练文本和图像编码器，如CLIP及其变体。在本研究中，我们质疑这种高成本的联合训练是否必要。具体而言，我们探讨了预训练且固定的大型语言模型（LLM）是否能够提供足够优秀的文本编码器来指导视觉表示学习。为此，我们提出了一种仅训练图像编码器、利用LLM中的固定文本编码器学习语言-图像对齐的方法，简称LIFT。令人惊讶的是，通过全面的基准测试和消融研究，我们发现这一极大简化的LIFT框架极为有效，在涉及组合理解和长文本描述的大多数场景中超越了CLIP，同时在计算效率上取得了显著提升。我们的工作迈出了系统探索LLM文本嵌入如何引导视觉学习的第一步，并为学习语言对齐的视觉表示提供了一种替代设计思路。

English

Currently, the most dominant approach to establishing language-image alignment is to pre-train text and image encoders jointly through contrastive learning, such as CLIP and its variants. In this work, we question whether such a costly joint training is necessary. In particular, we investigate if a pre-trained fixed large language model (LLM) offers a good enough text encoder to guide visual representation learning. That is, we propose to learn Language-Image alignment with a Fixed Text encoder (LIFT) from an LLM by training only the image encoder. Somewhat surprisingly, through comprehensive benchmarking and ablation studies, we find that this much simplified framework LIFT is highly effective and it outperforms CLIP in most scenarios that involve compositional understanding and long captions, while achieving considerable gains in computational efficiency. Our work takes a first step towards systematically exploring how text embeddings from LLMs can guide visual learning and suggests an alternative design choice for learning language-aligned visual representations.

固定文本编码器的语言-图像对齐

Language-Image Alignment with Fixed Text Encoders

摘要

Support