DOCCI：相連和對比圖像的描述

摘要

視覺語言數據集對於文本到圖像（T2I）和圖像到文本（I2T）研究至關重要。然而，目前的數據集缺乏細緻詳盡的描述，這些描述可以讓模型學習到更豐富的關聯。為了填補這一空白，我們引入了連接和對比圖像描述（DOCCI）數據集，其中包含長篇、人工標註的英文描述，涵蓋了1.5萬張圖像，這些圖像由單一研究人員拍攝、精心挑選並捐贈，旨在捕捉空間關係、計數、文本呈現、世界知識等關鍵挑戰。我們指示人類標註者為每張圖像創建全面的描述；這些描述平均長度為136個詞，旨在清晰地區分每張圖像與相關或相似的圖像。每個描述都高度組合，通常涵蓋多個挑戰。通過定量和定性分析，我們證明DOCCI可作為圖像到文本生成的有效訓練資源——在DOCCI上微調的PaLI 5B模型展現出與高性能更大模型（如LLaVA-1.5 7B和InstructBLIP 7B）相當或更優的結果。此外，我們展示DOCCI是文本到圖像生成的有用測試平臺，突顯了當前文本到圖像模型在捕捉長描述和細節方面的局限性。

English

Vision-language datasets are vital for both text-to-image (T2I) and image-to-text (I2T) research. However, current datasets lack descriptions with fine-grained detail that would allow for richer associations to be learned by models. To fill the gap, we introduce Descriptions of Connected and Contrasting Images (DOCCI), a dataset with long, human-annotated English descriptions for 15k images that were taken, curated and donated by a single researcher intent on capturing key challenges such as spatial relations, counting, text rendering, world knowledge, and more. We instruct human annotators to create comprehensive descriptions for each image; these average 136 words in length and are crafted to clearly distinguish each image from those that are related or similar. Each description is highly compositional and typically encompasses multiple challenges. Through both quantitative and qualitative analyses, we demonstrate that DOCCI serves as an effective training resource for image-to-text generation -- a PaLI 5B model finetuned on DOCCI shows equal or superior results compared to highly-performant larger models like LLaVA-1.5 7B and InstructBLIP 7B. Furthermore, we show that DOCCI is a useful testbed for text-to-image generation, highlighting the limitations of current text-to-image models in capturing long descriptions and fine details.

DOCCI：相連和對比圖像的描述

DOCCI: Descriptions of Connected and Contrasting Images

摘要

Support