DOCCI:连接和对比图像的描述
DOCCI: Descriptions of Connected and Contrasting Images
April 30, 2024
作者: Yasumasa Onoe, Sunayana Rane, Zachary Berger, Yonatan Bitton, Jaemin Cho, Roopal Garg, Alexander Ku, Zarana Parekh, Jordi Pont-Tuset, Garrett Tanzer, Su Wang, Jason Baldridge
cs.AI
摘要
视觉-语言数据集对于文本到图像(T2I)和图像到文本(I2T)研究至关重要。然而,当前数据集缺乏细致详尽的描述,这些描述可以让模型学习到更丰富的关联。为了填补这一空白,我们引入了连接和对比图像描述(DOCCI)数据集,其中包含长篇、人工注释的英文描述,涵盖了由单个研究人员拍摄、策划和捐赠的1.5万张图像。这个研究人员的目标是捕捉空间关系、计数、文本呈现、世界知识等关键挑战。我们指导人类标注者为每张图像创建全面的描述;这些描述平均长度为136个词,并旨在清晰地区分每张图像与相关或相似的图像。每个描述都高度组合,并通常涵盖多个挑战。通过定量和定性分析,我们证明DOCCI可作为图像到文本生成的有效训练资源——在DOCCI上微调的PaLI 5B模型显示出与高性能更大模型(如LLaVA-1.5 7B和InstructBLIP 7B)相当或更好的结果。此外,我们展示DOCCI是文本到图像生成的有用测试平台,突显了当前文本到图像模型在捕捉长描述和细节方面的局限性。
English
Vision-language datasets are vital for both text-to-image (T2I) and
image-to-text (I2T) research. However, current datasets lack descriptions with
fine-grained detail that would allow for richer associations to be learned by
models. To fill the gap, we introduce Descriptions of Connected and Contrasting
Images (DOCCI), a dataset with long, human-annotated English descriptions for
15k images that were taken, curated and donated by a single researcher intent
on capturing key challenges such as spatial relations, counting, text
rendering, world knowledge, and more. We instruct human annotators to create
comprehensive descriptions for each image; these average 136 words in length
and are crafted to clearly distinguish each image from those that are related
or similar. Each description is highly compositional and typically encompasses
multiple challenges. Through both quantitative and qualitative analyses, we
demonstrate that DOCCI serves as an effective training resource for
image-to-text generation -- a PaLI 5B model finetuned on DOCCI shows equal or
superior results compared to highly-performant larger models like LLaVA-1.5 7B
and InstructBLIP 7B. Furthermore, we show that DOCCI is a useful testbed for
text-to-image generation, highlighting the limitations of current text-to-image
models in capturing long descriptions and fine details.Summary
AI-Generated Summary