事実性が重要：画像生成と編集が構造化されたビジュアルに出会うとき

要旨

現代の視覚生成モデルは、美的に優れた自然画像の作成において優れているものの、チャート、図表、数学的図形などの構造化された視覚情報の生成や編集には苦戦しています。これらのタスクでは、構成計画、テキストレンダリング、および事実の正確性を保つためのマルチモーダル推論が求められます。この課題に対処するため、我々はこの領域における初の包括的かつ体系的な調査を提示します。これには、データ構築、モデル訓練、および評価ベンチマークが含まれます。まず、実行可能な描画プログラムから導出された130万組の高品質な構造化画像ペアの大規模データセットを構築し、連鎖的思考推論アノテーションで拡張しました。これを基盤として、VLMとFLUX.1 Kontextを軽量コネクタで統合した統一モデルを訓練し、強化されたマルチモーダル理解を実現します。三段階の訓練カリキュラムにより、段階的な特徴の整合、知識の注入、および推論を強化した生成が可能となり、推論時には外部推論器によってさらに性能が向上します。最後に、1,700以上の挑戦的なインスタンスを含む生成と編集のための新たなベンチマークStructBenchと、多段階のQ&Aプロトコルを用いて細かな事実の正確性を評価する評価指標StructScoreを導入します。15のモデルの評価により、主要なクローズドソースシステムでさえも満足のいく結果には程遠いことが明らかになりました。我々のモデルは強力な編集性能を発揮し、推論時の推論は多様なアーキテクチャにおいて一貫した向上をもたらします。データセット、モデル、およびベンチマークを公開することで、構造化された視覚情報のための統一されたマルチモーダル基盤の進展を目指します。

English

While modern visual generation models excel at creating aesthetically pleasing natural images, they struggle with producing or editing structured visuals like charts, diagrams, and mathematical figures, which demand composition planning, text rendering, and multimodal reasoning for factual fidelity. To address this, we present the first comprehensive, systematic investigation of this domain, encompassing data construction, model training, and an evaluation benchmark. First, we construct a large-scale dataset of 1.3 million high-quality structured image pairs derived from executable drawing programs and augmented with chain-of-thought reasoning annotations. Building on it, we train a unified model that integrates a VLM with FLUX.1 Kontext via a lightweight connector for enhanced multimodal understanding. A three-stage training curriculum enables progressive feature alignment, knowledge infusion, and reasoning-augmented generation, further boosted by an external reasoner at inference time. Finally, we introduce StructBench, a novel benchmark for generation and editing with over 1,700 challenging instances, and an accompanying evaluation metric, StructScore, which employs a multi-round Q\&A protocol to assess fine-grained factual accuracy. Evaluations of 15 models reveal that even leading closed-source systems remain far from satisfactory. Our model attains strong editing performance, and inference-time reasoning yields consistent gains across diverse architectures. By releasing the dataset, model, and benchmark, we aim to advance unified multimodal foundations for structured visuals.

事実性が重要：画像生成と編集が構造化されたビジュアルに出会うとき

Factuality Matters: When Image Generation and Editing Meet Structured Visuals

要旨

Support