FLUX-Reason-6M与PRISM-Bench:百万级图文推理数据集与综合基准测试
FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark
September 11, 2025
作者: Rongyao Fang, Aldrich Yu, Chengqi Duan, Linjiang Huang, Shuai Bai, Yuxuan Cai, Kun Wang, Si Liu, Xihui Liu, Hongsheng Li
cs.AI
摘要
开源文本生成图像(T2I)模型的进步一直受到大规模、以推理为核心的数据集及全面评估基准缺失的制约,导致其与领先的闭源系统之间存在性能差距。为应对这一挑战,我们推出了FLUX-Reason-6M和PRISM-Bench(精确与鲁棒的图像合成测量基准)。FLUX-Reason-6M是一个包含600万张高质量FLUX生成图像及2000万条双语(中英文)描述的大规模数据集,专为教授复杂推理而设计。这些图像依据六大关键特性组织:想象力、实体、文本渲染、风格、情感与构图,并设计了显式的生成思维链(GCoT)以提供图像生成步骤的详细分解。整个数据整理过程耗费了15,000个A100 GPU天,为社区提供了以往仅大型工业实验室才能获取的资源。PRISM-Bench则提出了一个包含七个独特赛道的新颖评估标准,其中包括使用GCoT的艰巨长文本挑战。通过精心设计的提示,它利用先进的视觉语言模型进行细致入微、与人类对齐的提示-图像对齐度及图像美学评估。我们对19个领先模型在PRISM-Bench上的广泛评估揭示了关键性能差距,并指出了需要改进的具体领域。我们的数据集、基准及评估代码已公开发布,旨在推动下一波以推理为导向的T2I生成研究。项目页面:https://flux-reason-6m.github.io/。
English
The advancement of open-source text-to-image (T2I) models has been hindered
by the absence of large-scale, reasoning-focused datasets and comprehensive
evaluation benchmarks, resulting in a performance gap compared to leading
closed-source systems. To address this challenge, We introduce FLUX-Reason-6M
and PRISM-Bench (Precise and Robust Image Synthesis Measurement Benchmark).
FLUX-Reason-6M is a massive dataset consisting of 6 million high-quality
FLUX-generated images and 20 million bilingual (English and Chinese)
descriptions specifically designed to teach complex reasoning. The image are
organized according to six key characteristics: Imagination, Entity, Text
rendering, Style, Affection, and Composition, and design explicit Generation
Chain-of-Thought (GCoT) to provide detailed breakdowns of image generation
steps. The whole data curation takes 15,000 A100 GPU days, providing the
community with a resource previously unattainable outside of large industrial
labs. PRISM-Bench offers a novel evaluation standard with seven distinct
tracks, including a formidable Long Text challenge using GCoT. Through
carefully designed prompts, it utilizes advanced vision-language models for
nuanced human-aligned assessment of prompt-image alignment and image
aesthetics. Our extensive evaluation of 19 leading models on PRISM-Bench
reveals critical performance gaps and highlights specific areas requiring
improvement. Our dataset, benchmark, and evaluation code are released to
catalyze the next wave of reasoning-oriented T2I generation. Project page:
https://flux-reason-6m.github.io/ .