TextAtlas5M：一個用於密集文字圖像生成的大規模數據集

摘要

近年來，受到廣泛關注的文本條件下的圖像生成正在處理越來越長且全面的文本提示。在日常生活中，密集而複雜的文本出現在廣告、信息圖表和標識等情境中，其中文本和視覺的整合對於傳達複雜信息至關重要。然而，儘管取得進展，生成包含長文本的圖像仍然是一個持久的挑戰，這主要是由於現有數據集的限制，這些數據集通常專注於較短和較簡單的文本。為了解決這一差距，我們引入了TextAtlas5M，這是一個專門設計用於評估文本條件下的圖像生成中長文本呈現的新數據集。我們的數據集包含500萬個跨不同數據類型生成和收集的長文本圖像，能夠全面評估大規模生成模型在長文本圖像生成上的表現。我們進一步精心策劃了3000個人工改進的測試集TextAtlasEval，涵蓋3個數據領域，建立了其中一個最廣泛的文本條件生成基準。評估表明，即使對於最先進的專有模型（例如具有DallE-3的GPT4o），TextAtlasEval基準也提出了重大挑戰，而其開源對應模型表現出更大的性能差距。這些證據將TextAtlas5M定位為一個有價值的數據集，用於訓練和評估未來一代文本條件下的圖像生成模型。

English

Text-conditioned image generation has gained significant attention in recent years and are processing increasingly longer and comprehensive text prompt. In everyday life, dense and intricate text appears in contexts like advertisements, infographics, and signage, where the integration of both text and visuals is essential for conveying complex information. However, despite these advances, the generation of images containing long-form text remains a persistent challenge, largely due to the limitations of existing datasets, which often focus on shorter and simpler text. To address this gap, we introduce TextAtlas5M, a novel dataset specifically designed to evaluate long-text rendering in text-conditioned image generation. Our dataset consists of 5 million long-text generated and collected images across diverse data types, enabling comprehensive evaluation of large-scale generative models on long-text image generation. We further curate 3000 human-improved test set TextAtlasEval across 3 data domains, establishing one of the most extensive benchmarks for text-conditioned generation. Evaluations suggest that the TextAtlasEval benchmarks present significant challenges even for the most advanced proprietary models (e.g. GPT4o with DallE-3), while their open-source counterparts show an even larger performance gap. These evidences position TextAtlas5M as a valuable dataset for training and evaluating future-generation text-conditioned image generation models.