FuseLIP: 이산 토큰의 조기 융합을 통한 다중모달 임베딩

초록

대조적 언어-이미지 사전 학습은 각 모달리티에 대한 별도의 인코더를 통해 텍스트-이미지 쌍의 특징을 공통 잠재 공간에 정렬합니다. 이 접근법은 여러 제로샷 작업에서 인상적인 성능을 달성하지만, 다중 모달 입력, 즉 이미지와 텍스트를 단일 특징 벡터로 인코딩하는 것을 기본적으로 처리할 수 없습니다. 이를 해결하기 위해 단일 모달 인코더로 추출된 특징을 병합하기 위해 추가 모듈을 사용하는 것이 일반적인 관행입니다. 본 연구에서는 다중 모달 임베딩을 위한 대안적 아키텍처인 FuseLIP를 제시합니다. 이산 이미지 토크나이저의 최근 발전을 활용하여, 텍스트와 이미지 토큰의 확장된 어휘를 기반으로 작동하는 단일 트랜스포머 모델을 사용할 것을 제안합니다. 이 초기 융합 접근법은 서로 다른 모달리티가 인코딩의 각 단계에서 상호작용할 수 있게 하여 일반적인 후기 융합에 비해 더 풍부한 표현을 얻을 수 있습니다. 우리는 다중 모달 사전 학습 및 평가를 위한 새로운 데이터셋을 수집하고, 다중 모달 인코더 모델을 위한 도전적인 작업을 설계합니다. FuseLIP가 VQA 및 텍스트 기반 이미지 변환 검색과 같은 다중 모달 임베딩 작업에서 다른 접근법을 능가하는 동시에 단일 모달 작업에서는 기준선과 비슷한 성능을 보임을 입증합니다.

English

Contrastive language-image pre-training aligns the features of text-image pairs in a common latent space via distinct encoders for each modality. While this approach achieves impressive performance in several zero-shot tasks, it cannot natively handle multimodal inputs, i.e., encoding image and text into a single feature vector. As a remedy, it is common practice to use additional modules to merge the features extracted by the unimodal encoders. In this work, we present FuseLIP, an alternative architecture for multimodal embedding. Leveraging recent progress in discrete image tokenizers, we propose to use a single transformer model which operates on an extended vocabulary of text and image tokens. This early fusion approach allows the different modalities to interact at each depth of encoding and obtain richer representations compared to common late fusion. We collect new datasets for multimodal pre-training and evaluation, designing challenging tasks for multimodal encoder models. We show that FuseLIP outperforms other approaches in multimodal embedding tasks such as VQA and text-guided image transformation retrieval, while being comparable to baselines on unimodal tasks.

FuseLIP: 이산 토큰의 조기 융합을 통한 다중모달 임베딩

FuseLIP: Multimodal Embeddings via Early Fusion of Discrete Tokens

초록

Support