FuseLIP：基於早期離散標記融合的多模態嵌入

摘要

對比性語言-圖像預訓練通過為每種模態配備獨立的編碼器，將文本-圖像對的特徵在一個共同的潛在空間中對齊。儘管這種方法在多項零樣本任務中展現了卓越的性能，但它無法原生處理多模態輸入，即無法將圖像和文本編碼為單一特徵向量。為解決這一問題，通常的做法是使用額外的模塊來融合由單模態編碼器提取的特徵。在本研究中，我們提出了FuseLIP，一種用於多模態嵌入的替代架構。借助離散圖像標記器的最新進展，我們建議使用一個單一的變換器模型，該模型操作於擴展的文本和圖像標記詞彙表上。這種早期融合方法使得不同模態在編碼的每一層都能相互作用，從而獲得更豐富的表徵，相比於常見的晚期融合方法。我們收集了新的數據集用於多模態預訓練和評估，設計了針對多模態編碼器模型的挑戰性任務。我們展示了FuseLIP在視覺問答（VQA）和文本引導的圖像轉換檢索等多模態嵌入任務中優於其他方法，同時在單模態任務上與基線模型表現相當。

English

Contrastive language-image pre-training aligns the features of text-image pairs in a common latent space via distinct encoders for each modality. While this approach achieves impressive performance in several zero-shot tasks, it cannot natively handle multimodal inputs, i.e., encoding image and text into a single feature vector. As a remedy, it is common practice to use additional modules to merge the features extracted by the unimodal encoders. In this work, we present FuseLIP, an alternative architecture for multimodal embedding. Leveraging recent progress in discrete image tokenizers, we propose to use a single transformer model which operates on an extended vocabulary of text and image tokens. This early fusion approach allows the different modalities to interact at each depth of encoding and obtain richer representations compared to common late fusion. We collect new datasets for multimodal pre-training and evaluation, designing challenging tasks for multimodal encoder models. We show that FuseLIP outperforms other approaches in multimodal embedding tasks such as VQA and text-guided image transformation retrieval, while being comparable to baselines on unimodal tasks.

FuseLIP：基於早期離散標記融合的多模態嵌入

FuseLIP: Multimodal Embeddings via Early Fusion of Discrete Tokens

摘要

Support