JourneyDB：生成式圖像理解基準測試

摘要

近年來視覺語言模型的進步革新了多模態理解，然而它們是否具備理解生成圖像的能力仍不清楚。與真實數據相比，合成圖像在內容和風格上呈現更高程度的多樣性，這對模型來說存在著相當大的困難。為此，我們提出了一個大規模數據集 JourneyDB，用於生成圖像的多模態視覺理解。我們精心編輯的數據集包含了 400 萬個多樣且高質量的生成圖像，並配對了用於生成它們的文本提示。我們進一步設計了 4 個基準來量化生成圖像理解的性能，包括提示反轉、風格檢索、圖像字幕和視覺問答。最後，我們評估了當前最先進的多模態模型在應用於 JourneyDB 時的性能，並對它們在生成內容理解方面的優勢和局限性進行了深入分析。我們希望提出的數據集和基準能促進生成內容理解領域的研究。數據集將可在 https://journeydb.github.io 上獲得。

English

While recent advancements in vision-language models have revolutionized multi-modal understanding, it remains unclear whether they possess the capabilities of comprehending the generated images. Compared to real data, synthetic images exhibit a higher degree of diversity in both content and style, for which there are significant difficulties for the models to fully apprehend. To this end, we present a large-scale dataset, JourneyDB, for multi-modal visual understanding in generative images. Our curated dataset covers 4 million diverse and high-quality generated images paired with the text prompts used to produce them. We further design 4 benchmarks to quantify the performance of generated image understanding in terms of both content and style interpretation. These benchmarks include prompt inversion, style retrieval, image captioning and visual question answering. Lastly, we assess the performance of current state-of-the-art multi-modal models when applied to JourneyDB, and provide an in-depth analysis of their strengths and limitations in generated content understanding. We hope the proposed dataset and benchmarks will facilitate the research in the field of generative content understanding. The dataset will be available on https://journeydb.github.io.

JourneyDB：生成式圖像理解基準測試

JourneyDB: A Benchmark for Generative Image Understanding

摘要

Support