CoLLM: 합성 이미지 검색을 위한 대형 언어 모델

초록

컴포즈드 이미지 검색(Composed Image Retrieval, CIR)은 멀티모달 쿼리를 기반으로 이미지를 검색하는 복잡한 작업입니다. 일반적인 학습 데이터는 참조 이미지, 원하는 수정 사항을 설명하는 텍스트, 그리고 타겟 이미지로 구성된 삼중항(triplet)으로 이루어져 있으며, 이를 수집하는 데는 많은 비용과 시간이 소요됩니다. CIR 데이터셋의 부족으로 인해 합성 삼중항을 활용하거나 웹에서 크롤링된 이미지-캡션 쌍을 이용한 제로샷(zero-shot) 접근법이 주로 사용되고 있습니다. 그러나 이러한 방법들은 상당한 한계를 가지고 있습니다: 합성 삼중항은 규모가 제한적이고 다양성이 부족하며, 수정 텍스트가 부자연스러운 반면, 이미지-캡션 쌍은 삼중항 데이터가 없어 멀티모달 쿼리의 공통 임베딩 학습을 방해합니다. 또한, 기존 방법들은 시각과 언어 모달리티의 정교한 융합과 이해를 요구하는 복잡하고 미묘한 수정 텍스트를 처리하는 데 어려움을 겪습니다. 본 논문에서는 이러한 한계를 효과적으로 해결하는 원스톱 프레임워크인 CoLLM을 제안합니다. 우리의 접근 방식은 이미지-캡션 쌍에서 실시간으로 삼중항을 생성하여 수동 주석 없이도 지도 학습을 가능하게 합니다. 대규모 언어 모델(Large Language Models, LLMs)을 활용하여 참조 이미지와 수정 텍스트의 공통 임베딩을 생성함으로써 더 깊은 멀티모달 융합을 촉진합니다. 또한, 340만 개의 샘플로 구성된 대규모 데이터셋인 Multi-Text CIR(MTCIR)를 소개하고, 기존 CIR 벤치마크(CIRR 및 Fashion-IQ)를 개선하여 평가의 신뢰성을 높였습니다. 실험 결과, CoLLM은 여러 CIR 벤치마크와 설정에서 최첨단 성능을 달성했습니다. MTCIR는 최대 15%의 성능 향상을 보이며 경쟁력 있는 결과를 보여주었습니다. 우리가 개선한 벤치마크는 CIR 모델에 대한 더 신뢰할 수 있는 평가 지표를 제공하여 이 중요한 분야의 발전에 기여합니다.

English

Composed Image Retrieval (CIR) is a complex task that aims to retrieve images based on a multimodal query. Typical training data consists of triplets containing a reference image, a textual description of desired modifications, and the target image, which are expensive and time-consuming to acquire. The scarcity of CIR datasets has led to zero-shot approaches utilizing synthetic triplets or leveraging vision-language models (VLMs) with ubiquitous web-crawled image-caption pairs. However, these methods have significant limitations: synthetic triplets suffer from limited scale, lack of diversity, and unnatural modification text, while image-caption pairs hinder joint embedding learning of the multimodal query due to the absence of triplet data. Moreover, existing approaches struggle with complex and nuanced modification texts that demand sophisticated fusion and understanding of vision and language modalities. We present CoLLM, a one-stop framework that effectively addresses these limitations. Our approach generates triplets on-the-fly from image-caption pairs, enabling supervised training without manual annotation. We leverage Large Language Models (LLMs) to generate joint embeddings of reference images and modification texts, facilitating deeper multimodal fusion. Additionally, we introduce Multi-Text CIR (MTCIR), a large-scale dataset comprising 3.4M samples, and refine existing CIR benchmarks (CIRR and Fashion-IQ) to enhance evaluation reliability. Experimental results demonstrate that CoLLM achieves state-of-the-art performance across multiple CIR benchmarks and settings. MTCIR yields competitive results, with up to 15% performance improvement. Our refined benchmarks provide more reliable evaluation metrics for CIR models, contributing to the advancement of this important field.

CoLLM: 합성 이미지 검색을 위한 대형 언어 모델

CoLLM: A Large Language Model for Composed Image Retrieval

초록

Support