검색 강화 대조 시각-텍스트 모델

초록

CLIP과 같은 대조적 이미지-텍스트 모델은 최첨단 시스템의 핵심 구성 요소입니다. 이러한 모델은 일반적인 개념을 인식하는 데 뛰어난 성능을 보이지만, 사전 학습 데이터셋에서 드물거나 심지어 존재하지 않는 세밀한 개체를 다루는 데는 여전히 어려움을 겪습니다. 따라서 이러한 모델의 성공에 있어 중요한 요소는 사전 학습 단계에서 기억할 수 있는 개념의 범위를 확장하기 위해 대규모로 정제된 사전 학습 데이터를 사용하는 것입니다. 본 연구에서는 세밀한 지식을 모델의 매개변수에 직접 인코딩하는 대신, 모델이 외부 메모리에서 이러한 지식을 검색하도록 훈련하는 대안을 탐구합니다. 구체적으로, 우리는 기존의 시각-텍스트 모델에 추론 시점에 메모리에서 교차 모달 정보를 검색하여 임베딩을 개선할 수 있는 능력을 부여하는 방법을 제안하며, 이는 모델의 제로샷 예측을 크게 향상시킵니다. 특히, 고정된 CLIP 모델 위에 경량의 단일 계층 융합 트랜스포머를 추가함으로써 이를 달성할 수 있음을 보여줍니다. 우리의 실험은 검색 강화 대조적(RECO) 훈련이 여러 도전적인 세밀한 작업에서 CLIP의 성능을 크게 개선함을 입증합니다. 예를 들어, Stanford Cars 데이터셋에서 +10.9, CUB-2011에서 +10.2, 최근 OVEN 벤치마크에서 +7.3의 성능 향상을 보였습니다.

English

Contrastive image-text models such as CLIP form the building blocks of many state-of-the-art systems. While they excel at recognizing common generic concepts, they still struggle on fine-grained entities which are rare, or even absent from the pre-training dataset. Hence, a key ingredient to their success has been the use of large-scale curated pre-training data aiming at expanding the set of concepts that they can memorize during the pre-training stage. In this work, we explore an alternative to encoding fine-grained knowledge directly into the model's parameters: we instead train the model to retrieve this knowledge from an external memory. Specifically, we propose to equip existing vision-text models with the ability to refine their embedding with cross-modal retrieved information from a memory at inference time, which greatly improves their zero-shot predictions. Remarkably, we show that this can be done with a light-weight, single-layer, fusion transformer on top of a frozen CLIP. Our experiments validate that our retrieval-enhanced contrastive (RECO) training improves CLIP performance substantially on several challenging fine-grained tasks: for example +10.9 on Stanford Cars, +10.2 on CUB-2011 and +7.3 on the recent OVEN benchmark.

검색 강화 대조 시각-텍스트 모델

Retrieval-Enhanced Contrastive Vision-Text Models

초록

Support