이미지-텍스트 사전 학습에서 세밀한 이해 능력 향상

초록

본 논문에서는 이미지-텍스트 쌍으로부터 더 세밀한 다중모달 표현을 사전 학습하기 위한 간단한 방법인 SPARse Fine-grained Contrastive Alignment(SPARC)를 소개한다. 여러 이미지 패치가 종종 단일 단어에 대응된다는 점을 고려하여, 우리는 캡션의 각 토큰에 대해 이미지 패치의 그룹화를 학습하는 방법을 제안한다. 이를 위해 이미지 패치와 언어 토큰 간의 희소 유사성 메트릭을 사용하고, 각 토큰에 대해 패치의 가중 평균으로 언어 그룹화된 시각 임베딩을 계산한다. 이후 토큰과 언어 그룹화된 시각 임베딩은 개별 샘플에만 의존하며 다른 배치 샘플을 네거티브로 요구하지 않는 세밀한 시퀀스 단위 손실을 통해 대조된다. 이를 통해 계산 비용이 적으면서도 더 상세한 정보를 학습할 수 있다. SPARC는 이 세밀한 손실과 전역 이미지 및 텍스트 임베딩 간의 대조 손실을 결합하여 전역 및 지역 정보를 동시에 인코딩하는 표현을 학습한다. 우리는 제안된 방법을 철저히 평가하고, 분류와 같은 거친 정보에 의존하는 이미지 수준 작업뿐만 아니라 검색, 객체 탐지, 분할과 같은 세밀한 정보에 의존하는 지역 수준 작업에서도 경쟁 방법들보다 향상된 성능을 보임을 입증한다. 또한, SPARC는 기본적인 시각-언어 모델의 신뢰성과 캡션 생성 능력을 향상시킨다.

English

We introduce SPARse Fine-grained Contrastive Alignment (SPARC), a simple method for pretraining more fine-grained multimodal representations from image-text pairs. Given that multiple image patches often correspond to single words, we propose to learn a grouping of image patches for every token in the caption. To achieve this, we use a sparse similarity metric between image patches and language tokens and compute for each token a language-grouped vision embedding as the weighted average of patches. The token and language-grouped vision embeddings are then contrasted through a fine-grained sequence-wise loss that only depends on individual samples and does not require other batch samples as negatives. This enables more detailed information to be learned in a computationally inexpensive manner. SPARC combines this fine-grained loss with a contrastive loss between global image and text embeddings to learn representations that simultaneously encode global and local information. We thoroughly evaluate our proposed method and show improved performance over competing approaches both on image-level tasks relying on coarse-grained information, e.g. classification, as well as region-level tasks relying on fine-grained information, e.g. retrieval, object detection, and segmentation. Moreover, SPARC improves model faithfulness and captioning in foundational vision-language models.

이미지-텍스트 사전 학습에서 세밀한 이해 능력 향상

Improving fine-grained understanding in image-text pre-training

초록

Support