FuseLIP：通过离散标记的早期融合实现多模态嵌入

摘要

对比语言-图像预训练通过为每种模态配备独立的编码器，将文本-图像对的特征对齐到一个共同的潜在空间中。尽管这种方法在多项零样本任务中取得了令人瞩目的成绩，但它无法原生处理多模态输入，即无法将图像和文本编码为单一特征向量。为解决这一问题，通常的做法是使用额外模块来融合由单模态编码器提取的特征。在本研究中，我们提出了FuseLIP，一种多模态嵌入的替代架构。借助离散图像分词器的最新进展，我们提议采用一个单一Transformer模型，该模型操作于扩展的文本与图像词汇表上。这种早期融合策略使得不同模态能在编码的每一层深度进行交互，相较于常见的后期融合，能获得更为丰富的表征。我们收集了新的数据集用于多模态预训练与评估，设计了针对多模态编码器模型的挑战性任务。实验表明，FuseLIP在视觉问答（VQA）和文本引导的图像变换检索等多模态嵌入任务中优于其他方法，同时在单模态任务上与基线模型表现相当。

English

Contrastive language-image pre-training aligns the features of text-image pairs in a common latent space via distinct encoders for each modality. While this approach achieves impressive performance in several zero-shot tasks, it cannot natively handle multimodal inputs, i.e., encoding image and text into a single feature vector. As a remedy, it is common practice to use additional modules to merge the features extracted by the unimodal encoders. In this work, we present FuseLIP, an alternative architecture for multimodal embedding. Leveraging recent progress in discrete image tokenizers, we propose to use a single transformer model which operates on an extended vocabulary of text and image tokens. This early fusion approach allows the different modalities to interact at each depth of encoding and obtain richer representations compared to common late fusion. We collect new datasets for multimodal pre-training and evaluation, designing challenging tasks for multimodal encoder models. We show that FuseLIP outperforms other approaches in multimodal embedding tasks such as VQA and text-guided image transformation retrieval, while being comparable to baselines on unimodal tasks.

FuseLIP：通过离散标记的早期融合实现多模态嵌入

FuseLIP: Multimodal Embeddings via Early Fusion of Discrete Tokens

摘要

Support