자연어를 사용한 이미지 집합 간 차이점 기술

초록

두 이미지 집합은 어떻게 다른가? 집합 수준의 차이를 파악하는 것은 모델의 동작을 이해하고 데이터셋을 분석하는 데 있어 핵심적이지만, 수천 장의 이미지를 수동으로 살펴보는 것은 비현실적이다. 이러한 발견 과정을 돕기 위해, 우리는 두 이미지 집합 간의 차이를 자동으로 설명하는 작업을 탐구하며, 이를 '집합 차이 캡션 생성(Set Difference Captioning)'이라고 명명한다. 이 작업은 이미지 집합 D_A와 D_B를 입력으로 받아, D_A에서 더 자주 참인 설명을 출력한다. 우리는 먼저 이미지 집합에서 후보 차이 설명을 제안하고, 이 후보들이 두 집합을 얼마나 잘 구별하는지 확인하여 재순위를 매기는 두 단계 접근법을 제시한다. 우리는 VisDiff를 소개하는데, 이는 먼저 이미지에 캡션을 생성하고 언어 모델을 통해 후보 설명을 제안한 다음, CLIP을 사용하여 이러한 설명을 재순위 매긴다. VisDiff를 평가하기 위해, 우리는 187개의 짝을 이룬 이미지 집합과 실제 차이 설명을 포함한 VisDiffBench 데이터셋을 수집한다. 우리는 VisDiff를 다양한 영역에 적용했는데, 예를 들어 데이터셋 비교(예: ImageNet vs. ImageNetV2), 분류 모델 비교(예: 제로샷 CLIP vs. 지도 학습 ResNet), 모델 실패 모드 요약(지도 학습 ResNet), 생성 모델 간 차이 특성화(예: StableDiffusionV1과 V2), 그리고 이미지가 기억에 남는 이유를 발견하는 데 사용했다. VisDiff를 사용함으로써, 우리는 데이터셋과 모델에서 흥미롭고 이전에 알려지지 않은 차이점을 발견할 수 있었으며, 이는 미묘한 통찰력을 드러내는 데 있어 VisDiff의 유용성을 입증한다.

English

How do two sets of images differ? Discerning set-level differences is crucial for understanding model behaviors and analyzing datasets, yet manually sifting through thousands of images is impractical. To aid in this discovery process, we explore the task of automatically describing the differences between two sets of images, which we term Set Difference Captioning. This task takes in image sets D_A and D_B, and outputs a description that is more often true on D_A than D_B. We outline a two-stage approach that first proposes candidate difference descriptions from image sets and then re-ranks the candidates by checking how well they can differentiate the two sets. We introduce VisDiff, which first captions the images and prompts a language model to propose candidate descriptions, then re-ranks these descriptions using CLIP. To evaluate VisDiff, we collect VisDiffBench, a dataset with 187 paired image sets with ground truth difference descriptions. We apply VisDiff to various domains, such as comparing datasets (e.g., ImageNet vs. ImageNetV2), comparing classification models (e.g., zero-shot CLIP vs. supervised ResNet), summarizing model failure modes (supervised ResNet), characterizing differences between generative models (e.g., StableDiffusionV1 and V2), and discovering what makes images memorable. Using VisDiff, we are able to find interesting and previously unknown differences in datasets and models, demonstrating its utility in revealing nuanced insights.

자연어를 사용한 이미지 집합 간 차이점 기술

Describing Differences in Image Sets with Natural Language

초록

Support