JourneyDB:生成式影像理解基準測試集
JourneyDB: A Benchmark for Generative Image Understanding
July 3, 2023
作者: Junting Pan, Keqiang Sun, Yuying Ge, Hao Li, Haodong Duan, Xiaoshi Wu, Renrui Zhang, Aojun Zhou, Zipeng Qin, Yi Wang, Jifeng Dai, Yu Qiao, Hongsheng Li
cs.AI
摘要
儘管視覺語言模型的最新進展已徹底變革多模態理解領域,但這些模型是否具備理解生成圖像的能力仍不明確。與真實數據相比,合成圖像在內容與風格上呈現更高程度的多樣性,這使得模型難以完全掌握其特徵。為此,我們提出大規模數據集JourneyDB,用於生成式圖像的多模態視覺理解。本數據集收錄400萬張兼具多樣性與高品質的生成圖像,並配對其對應的生成文本提示。我們進一步設計四項基準測試,從內容與風格詮釋兩個維度量化生成式圖像理解能力,包括提示詞反轉、風格檢索、圖像描述及視覺問答。最後,我們評估當前頂尖多模態模型在JourneyDB上的表現,並深入分析其在生成內容理解方面的優勢與局限。期望所提出的數據集與基準測試能推動生成式內容理解領域的研究進展。數據集將公開於 https://journeydb.github.io。
English
While recent advancements in vision-language models have revolutionized
multi-modal understanding, it remains unclear whether they possess the
capabilities of comprehending the generated images. Compared to real data,
synthetic images exhibit a higher degree of diversity in both content and
style, for which there are significant difficulties for the models to fully
apprehend. To this end, we present a large-scale dataset, JourneyDB, for
multi-modal visual understanding in generative images. Our curated dataset
covers 4 million diverse and high-quality generated images paired with the text
prompts used to produce them. We further design 4 benchmarks to quantify the
performance of generated image understanding in terms of both content and style
interpretation. These benchmarks include prompt inversion, style retrieval,
image captioning and visual question answering. Lastly, we assess the
performance of current state-of-the-art multi-modal models when applied to
JourneyDB, and provide an in-depth analysis of their strengths and limitations
in generated content understanding. We hope the proposed dataset and benchmarks
will facilitate the research in the field of generative content understanding.
The dataset will be available on https://journeydb.github.io.