一張圖片價值超過77個文字標記：評估密集標題上的CLIP風格模型

摘要

在大視覺-語言資料集的策展方法中，存在著資料集大小和質量之間的取捨。然而，即使是最高質量的現有策展標題也遠遠不足以捕捉圖像中豐富的視覺細節。為了展示密集且高度對齊的圖像-文字配對的價值，我們收集了包含8012張自然圖像的「密集標題圖像（DCI）」資料集，每張圖像都有人工標註的遮罩對齊描述，平均每個描述超過1000個字。通過與圖像特定部分相關聯的精確可靠的標題，我們可以評估視覺-語言模型（VLMs）對圖像內容的理解，並提出一個新的任務，將每個標題與其相應的子區域進行匹配。由於當前模型通常僅限於77個文本標記，因此我們還引入了一個總結版本（sDCI），其中限制了每個標題的長度。我們表明，在標準基準上取得進展的現代技術並不意味著在基於我們的sDCI基準上有顯著改進。最後，我們使用sDCI對CLIP進行微調，儘管訓練集很小，但相較於基準，顯示出顯著的改進。通過釋出第一個人工標註的密集圖像標題資料集，我們希望促進新基準或微調配方的開發，以應對即將到來的下一代VLMs。

English

Curation methods for massive vision-language datasets trade off between dataset size and quality. However, even the highest quality of available curated captions are far too short to capture the rich visual detail in an image. To show the value of dense and highly-aligned image-text pairs, we collect the Densely Captioned Images (DCI) dataset, containing 8012 natural images human-annotated with mask-aligned descriptions averaging above 1000 words each. With precise and reliable captions associated with specific parts of an image, we can evaluate vision-language models' (VLMs) understanding of image content with a novel task that matches each caption with its corresponding subcrop. As current models are often limited to 77 text tokens, we also introduce a summarized version (sDCI) in which each caption length is limited. We show that modern techniques that make progress on standard benchmarks do not correspond with significant improvement on our sDCI based benchmark. Lastly, we finetune CLIP using sDCI and show significant improvements over the baseline despite a small training set. By releasing the first human annotated dense image captioning dataset, we hope to enable the development of new benchmarks or fine-tuning recipes for the next generation of VLMs to come.

一張圖片價值超過77個文字標記：評估密集標題上的CLIP風格模型

A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions

摘要

Support