FLUX-Reason-6M & PRISM-Bench: 百万規模のテキストから画像への推論データセットと包括的ベンチマーク

要旨

オープンソースのテキストから画像生成（T2I）モデルの進展は、大規模で推論に焦点を当てたデータセットと包括的な評価ベンチマークの欠如によって妨げられており、主要なクローズドソースシステムとの性能差が生じている。この課題に対処するため、我々はFLUX-Reason-6MとPRISM-Bench（Precise and Robust Image Synthesis Measurement Benchmark）を導入する。FLUX-Reason-6Mは、600万枚の高品質なFLUX生成画像と2000万の二言語（英語と中国語）説明からなる大規模データセットであり、複雑な推論を教えるために特別に設計されている。画像は、想像力、実体、テキストレンダリング、スタイル、感情、構成の6つの主要な特性に従って整理され、画像生成ステップの詳細な分解を提供するための明示的な生成連鎖思考（GCoT）が設計されている。データキュレーション全体には15,000 A100 GPU日を要し、大規模な産業ラボ以外ではこれまで達成できなかったリソースをコミュニティに提供する。PRISM-Benchは、GCoTを使用した困難な長文チャレンジを含む7つの異なるトラックを備えた新しい評価基準を提供する。慎重に設計されたプロンプトを通じて、高度な視覚言語モデルを活用し、プロンプトと画像の整合性および画像の美学を人間に沿ったニュアンスで評価する。PRISM-Benchでの19の主要モデルの広範な評価により、重要な性能差が明らかになり、改善が必要な特定の領域が強調される。我々のデータセット、ベンチマーク、および評価コードは、推論指向のT2I生成の次の波を促進するために公開される。プロジェクトページ: https://flux-reason-6m.github.io/ 。

English

The advancement of open-source text-to-image (T2I) models has been hindered by the absence of large-scale, reasoning-focused datasets and comprehensive evaluation benchmarks, resulting in a performance gap compared to leading closed-source systems. To address this challenge, We introduce FLUX-Reason-6M and PRISM-Bench (Precise and Robust Image Synthesis Measurement Benchmark). FLUX-Reason-6M is a massive dataset consisting of 6 million high-quality FLUX-generated images and 20 million bilingual (English and Chinese) descriptions specifically designed to teach complex reasoning. The image are organized according to six key characteristics: Imagination, Entity, Text rendering, Style, Affection, and Composition, and design explicit Generation Chain-of-Thought (GCoT) to provide detailed breakdowns of image generation steps. The whole data curation takes 15,000 A100 GPU days, providing the community with a resource previously unattainable outside of large industrial labs. PRISM-Bench offers a novel evaluation standard with seven distinct tracks, including a formidable Long Text challenge using GCoT. Through carefully designed prompts, it utilizes advanced vision-language models for nuanced human-aligned assessment of prompt-image alignment and image aesthetics. Our extensive evaluation of 19 leading models on PRISM-Bench reveals critical performance gaps and highlights specific areas requiring improvement. Our dataset, benchmark, and evaluation code are released to catalyze the next wave of reasoning-oriented T2I generation. Project page: https://flux-reason-6m.github.io/ .