FLUX-Reason-6M 與 PRISM-Bench:百萬規模的文本到圖像推理 數據集與綜合基準測試
FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark
September 11, 2025
作者: Rongyao Fang, Aldrich Yu, Chengqi Duan, Linjiang Huang, Shuai Bai, Yuxuan Cai, Kun Wang, Si Liu, Xihui Liu, Hongsheng Li
cs.AI
摘要
開源文本到圖像(T2I)模型的發展一直受到大規模、專注於推理的數據集和全面評估基準缺失的阻礙,導致其性能與領先的閉源系統存在差距。為應對這一挑戰,我們推出了FLUX-Reason-6M和PRISM-Bench(精確且穩健的圖像合成測量基準)。FLUX-Reason-6M是一個龐大的數據集,包含600萬張高質量的FLUX生成圖像和2000萬條雙語(英文和中文)描述,專門設計用於教授複雜推理。這些圖像根據六個關鍵特徵進行組織:想象力、實體、文本渲染、風格、情感和構圖,並設計了明確的生成思維鏈(GCoT)來提供圖像生成步驟的詳細分解。整個數據整理過程耗費了15,000個A100 GPU天,為社區提供了以往僅在大型工業實驗室中才能獲得的資源。PRISM-Bench提供了一個新穎的評估標準,包含七個不同的軌道,其中包括使用GCoT的艱鉅長文本挑戰。通過精心設計的提示,它利用先進的視覺語言模型進行細緻的人類對齊評估,涵蓋提示-圖像對齊和圖像美學。我們在PRISM-Bench上對19個領先模型進行了廣泛評估,揭示了關鍵的性能差距,並突出了需要改進的具體領域。我們的數據集、基準和評估代碼均已發布,以推動下一波面向推理的T2I生成。項目頁面:https://flux-reason-6m.github.io/。
English
The advancement of open-source text-to-image (T2I) models has been hindered
by the absence of large-scale, reasoning-focused datasets and comprehensive
evaluation benchmarks, resulting in a performance gap compared to leading
closed-source systems. To address this challenge, We introduce FLUX-Reason-6M
and PRISM-Bench (Precise and Robust Image Synthesis Measurement Benchmark).
FLUX-Reason-6M is a massive dataset consisting of 6 million high-quality
FLUX-generated images and 20 million bilingual (English and Chinese)
descriptions specifically designed to teach complex reasoning. The image are
organized according to six key characteristics: Imagination, Entity, Text
rendering, Style, Affection, and Composition, and design explicit Generation
Chain-of-Thought (GCoT) to provide detailed breakdowns of image generation
steps. The whole data curation takes 15,000 A100 GPU days, providing the
community with a resource previously unattainable outside of large industrial
labs. PRISM-Bench offers a novel evaluation standard with seven distinct
tracks, including a formidable Long Text challenge using GCoT. Through
carefully designed prompts, it utilizes advanced vision-language models for
nuanced human-aligned assessment of prompt-image alignment and image
aesthetics. Our extensive evaluation of 19 leading models on PRISM-Bench
reveals critical performance gaps and highlights specific areas requiring
improvement. Our dataset, benchmark, and evaluation code are released to
catalyze the next wave of reasoning-oriented T2I generation. Project page:
https://flux-reason-6m.github.io/ .