一张图片价值超过77个文本标记：在密集字幕上评估CLIP风格模型

摘要

大规模视觉-语言数据集的筛选方法在数据集大小和质量之间进行权衡。然而，即使是现有的最高质量的筛选字幕也远远不足以捕捉图像中丰富的视觉细节。为了展示密集且高度对齐的图像-文本配对的价值，我们收集了密集字幕图像（DCI）数据集，包含8012张自然图像，人工注释了与蒙版对齐的描述，每个描述平均超过1000个字。通过精确可靠的字幕与图像特定部分相关联，我们可以评估视觉-语言模型（VLMs）对图像内容的理解，提出了一个新颖的任务，将每个字幕与其相应的子裁剪匹配。由于当前模型通常限制为77个文本标记，我们还引入了一个总结版本（sDCI），其中每个字幕长度受限。我们表明，对标准基准取得进展的现代技术并不意味着在基于我们的基准sDCI上取得显著改进。最后，我们使用sDCI微调了CLIP，并显示出明显的改进，尽管训练集规模较小。通过发布第一个人工注释的密集图像字幕数据集，我们希望能够促进新一代VLMs的新基准或微调配方的发展。

English

Curation methods for massive vision-language datasets trade off between dataset size and quality. However, even the highest quality of available curated captions are far too short to capture the rich visual detail in an image. To show the value of dense and highly-aligned image-text pairs, we collect the Densely Captioned Images (DCI) dataset, containing 8012 natural images human-annotated with mask-aligned descriptions averaging above 1000 words each. With precise and reliable captions associated with specific parts of an image, we can evaluate vision-language models' (VLMs) understanding of image content with a novel task that matches each caption with its corresponding subcrop. As current models are often limited to 77 text tokens, we also introduce a summarized version (sDCI) in which each caption length is limited. We show that modern techniques that make progress on standard benchmarks do not correspond with significant improvement on our sDCI based benchmark. Lastly, we finetune CLIP using sDCI and show significant improvements over the baseline despite a small training set. By releasing the first human annotated dense image captioning dataset, we hope to enable the development of new benchmarks or fine-tuning recipes for the next generation of VLMs to come.

一张图片价值超过77个文本标记：在密集字幕上评估CLIP风格模型

A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions

摘要

Support