ChatPaper.aiChatPaper

JourneyDB:生成式图像理解基准数据集

JourneyDB: A Benchmark for Generative Image Understanding

July 3, 2023
作者: Junting Pan, Keqiang Sun, Yuying Ge, Hao Li, Haodong Duan, Xiaoshi Wu, Renrui Zhang, Aojun Zhou, Zipeng Qin, Yi Wang, Jifeng Dai, Yu Qiao, Hongsheng Li
cs.AI

摘要

尽管视觉语言模型的最新进展已彻底改变多模态理解领域,但其是否具备理解生成图像的能力仍不明确。与真实数据相比,合成图像在内容与风格上呈现出更高程度的多样性,这给模型实现完整理解带来了显著挑战。为此,我们推出大规模数据集JourneyDB,专门用于生成图像的多模态视觉理解。该精选数据集涵盖400万张多样化且高质量的生成图像,并附有对应的生成文本提示。我们进一步设计四项基准测试,从内容与风格解读两个维度量化生成图像理解性能,包括提示词反推、风格检索、图像描述和视觉问答。最后,我们评估了当前最先进多模态模型在JourneyDB上的表现,并深入分析了它们在生成内容理解方面的优势与局限。我们希望所提出的数据集与基准测试能推动生成式内容理解领域的研究。数据集将在https://journeydb.github.io开放获取。
English
While recent advancements in vision-language models have revolutionized multi-modal understanding, it remains unclear whether they possess the capabilities of comprehending the generated images. Compared to real data, synthetic images exhibit a higher degree of diversity in both content and style, for which there are significant difficulties for the models to fully apprehend. To this end, we present a large-scale dataset, JourneyDB, for multi-modal visual understanding in generative images. Our curated dataset covers 4 million diverse and high-quality generated images paired with the text prompts used to produce them. We further design 4 benchmarks to quantify the performance of generated image understanding in terms of both content and style interpretation. These benchmarks include prompt inversion, style retrieval, image captioning and visual question answering. Lastly, we assess the performance of current state-of-the-art multi-modal models when applied to JourneyDB, and provide an in-depth analysis of their strengths and limitations in generated content understanding. We hope the proposed dataset and benchmarks will facilitate the research in the field of generative content understanding. The dataset will be available on https://journeydb.github.io.
PDF190December 15, 2024