使用自然語言描述影像集之間的差異

摘要

兩組影像有何不同？識別集合級別的差異對於理解模型行為和分析數據集至關重要，然而手動篩選成千上萬張影像是不切實際的。為了幫助這一發現過程，我們探索了自動描述兩組影像之間差異的任務，我們稱之為集合差異標題生成。該任務接收影像集合 D_A 和 D_B，並輸出一個更常在 D_A 上為真的描述。我們概述了一種兩階段方法，首先從影像集合中提出候選差異描述，然後通過檢查它們能多好地區分這兩組影像集合來重新排名這些候選描述。我們引入了 VisDiff，首先對影像進行標題生成，促使語言模型提出候選描述，然後使用 CLIP 重新排名這些描述。為了評估 VisDiff，我們收集了一個帶有 187 對影像集合和真實差異描述的數據集 VisDiffBench。我們將 VisDiff 應用於各種領域，如比較數據集（例如 ImageNet vs. ImageNetV2）、比較分類模型（例如零樣本 CLIP vs. 監督式 ResNet）、總結模型失敗模式（監督式 ResNet）、表徵生成模型之間的差異（例如 StableDiffusionV1 和 V2），以及發現是什麼使影像令人難忘。通過使用 VisDiff，我們能夠發現數據集和模型中有趣且以前未知的差異，展示了它在揭示微妙見解方面的實用性。

English

How do two sets of images differ? Discerning set-level differences is crucial for understanding model behaviors and analyzing datasets, yet manually sifting through thousands of images is impractical. To aid in this discovery process, we explore the task of automatically describing the differences between two sets of images, which we term Set Difference Captioning. This task takes in image sets D_A and D_B, and outputs a description that is more often true on D_A than D_B. We outline a two-stage approach that first proposes candidate difference descriptions from image sets and then re-ranks the candidates by checking how well they can differentiate the two sets. We introduce VisDiff, which first captions the images and prompts a language model to propose candidate descriptions, then re-ranks these descriptions using CLIP. To evaluate VisDiff, we collect VisDiffBench, a dataset with 187 paired image sets with ground truth difference descriptions. We apply VisDiff to various domains, such as comparing datasets (e.g., ImageNet vs. ImageNetV2), comparing classification models (e.g., zero-shot CLIP vs. supervised ResNet), summarizing model failure modes (supervised ResNet), characterizing differences between generative models (e.g., StableDiffusionV1 and V2), and discovering what makes images memorable. Using VisDiff, we are able to find interesting and previously unknown differences in datasets and models, demonstrating its utility in revealing nuanced insights.

使用自然語言描述影像集之間的差異

Describing Differences in Image Sets with Natural Language

摘要

Support