JourneyDB:生成式圖像理解基準測試
JourneyDB: A Benchmark for Generative Image Understanding
July 3, 2023
作者: Junting Pan, Keqiang Sun, Yuying Ge, Hao Li, Haodong Duan, Xiaoshi Wu, Renrui Zhang, Aojun Zhou, Zipeng Qin, Yi Wang, Jifeng Dai, Yu Qiao, Hongsheng Li
cs.AI
摘要
近年來視覺語言模型的進步革新了多模態理解,然而它們是否具備理解生成圖像的能力仍不清楚。與真實數據相比,合成圖像在內容和風格上呈現更高程度的多樣性,這對模型來說存在著相當大的困難。為此,我們提出了一個大規模數據集 JourneyDB,用於生成圖像的多模態視覺理解。我們精心編輯的數據集包含了 400 萬個多樣且高質量的生成圖像,並配對了用於生成它們的文本提示。我們進一步設計了 4 個基準來量化生成圖像理解的性能,包括提示反轉、風格檢索、圖像字幕和視覺問答。最後,我們評估了當前最先進的多模態模型在應用於 JourneyDB 時的性能,並對它們在生成內容理解方面的優勢和局限性進行了深入分析。我們希望提出的數據集和基準能促進生成內容理解領域的研究。數據集將可在 https://journeydb.github.io 上獲得。
English
While recent advancements in vision-language models have revolutionized
multi-modal understanding, it remains unclear whether they possess the
capabilities of comprehending the generated images. Compared to real data,
synthetic images exhibit a higher degree of diversity in both content and
style, for which there are significant difficulties for the models to fully
apprehend. To this end, we present a large-scale dataset, JourneyDB, for
multi-modal visual understanding in generative images. Our curated dataset
covers 4 million diverse and high-quality generated images paired with the text
prompts used to produce them. We further design 4 benchmarks to quantify the
performance of generated image understanding in terms of both content and style
interpretation. These benchmarks include prompt inversion, style retrieval,
image captioning and visual question answering. Lastly, we assess the
performance of current state-of-the-art multi-modal models when applied to
JourneyDB, and provide an in-depth analysis of their strengths and limitations
in generated content understanding. We hope the proposed dataset and benchmarks
will facilitate the research in the field of generative content understanding.
The dataset will be available on https://journeydb.github.io.