고정된 텍스트 인코더를 활용한 언어-이미지 정렬

초록

현재 언어-이미지 정렬을 구축하는 가장 주류적인 접근 방식은 CLIP 및 그 변형들과 같은 대조 학습(contrastive learning)을 통해 텍스트와 이미지 인코더를 공동으로 사전 학습하는 것입니다. 본 연구에서는 이러한 고비용의 공동 학습이 정말 필요한지에 대해 의문을 제기합니다. 특히, 사전 학습된 고정된 대형 언어 모델(LLM)이 시각적 표현 학습을 안내하기에 충분히 좋은 텍스트 인코더를 제공할 수 있는지 조사합니다. 즉, LLM에서 고정된 텍스트 인코더(Fixed Text Encoder)를 사용하여 언어-이미지 정렬을 학습하는 LIFT(Language-Image alignment with a Fixed Text encoder) 방법을 제안하며, 이때 이미지 인코더만을 학습합니다. 놀랍게도, 포괄적인 벤치마킹과 제거 연구(ablation studies)를 통해 이렇게 단순화된 LIFT 프레임워크가 매우 효과적이며, 구성적 이해(compositional understanding)와 긴 캡션(long captions)이 필요한 대부분의 시나리오에서 CLIP을 능가하는 동시에 계산 효율성에서 상당한 이점을 달성함을 발견했습니다. 본 연구는 LLM에서 추출한 텍스트 임베딩이 시각적 학습을 어떻게 안내할 수 있는지 체계적으로 탐구하는 첫걸음을 내딛으며, 언어 정렬된 시각적 표현을 학습하기 위한 대안적인 설계 선택을 제안합니다.

English

Currently, the most dominant approach to establishing language-image alignment is to pre-train text and image encoders jointly through contrastive learning, such as CLIP and its variants. In this work, we question whether such a costly joint training is necessary. In particular, we investigate if a pre-trained fixed large language model (LLM) offers a good enough text encoder to guide visual representation learning. That is, we propose to learn Language-Image alignment with a Fixed Text encoder (LIFT) from an LLM by training only the image encoder. Somewhat surprisingly, through comprehensive benchmarking and ablation studies, we find that this much simplified framework LIFT is highly effective and it outperforms CLIP in most scenarios that involve compositional understanding and long captions, while achieving considerable gains in computational efficiency. Our work takes a first step towards systematically exploring how text embeddings from LLMs can guide visual learning and suggests an alternative design choice for learning language-aligned visual representations.

고정된 텍스트 인코더를 활용한 언어-이미지 정렬

Language-Image Alignment with Fixed Text Encoders

초록

Support