事实性至关重要：当图像生成与编辑遇上结构化视觉元素

摘要

尽管现代视觉生成模型在创作美观的自然图像方面表现出色，但在生成或编辑结构化视觉内容（如图表、示意图和数学图形）时却面临挑战，这些任务需要构图规划、文本渲染以及多模态推理以确保事实准确性。为此，我们首次对这一领域进行了全面系统的研究，涵盖了数据构建、模型训练及评估基准的建立。首先，我们构建了一个包含130万对高质量结构化图像的大规模数据集，这些图像源自可执行的绘图程序，并通过链式思维推理注释进行了增强。在此基础上，我们训练了一个统一模型，该模型通过轻量级连接器将视觉语言模型（VLM）与FLUX.1 Kontext集成，以增强多模态理解能力。采用三阶段训练课程，实现了渐进式特征对齐、知识注入及推理增强的生成，并在推理时通过外部推理器进一步提升了性能。最后，我们推出了StructBench，这是一个包含1700多个挑战性实例的生成与编辑新基准，并配套了StructScore评估指标，该指标采用多轮问答协议来评估细粒度的事实准确性。对15个模型的评估显示，即便是领先的闭源系统也远未达到令人满意的水平。我们的模型在编辑性能上表现强劲，推理时的推理能力在不同架构中均带来了持续的提升。通过公开数据集、模型及基准，我们旨在推动结构化视觉内容统一多模态基础的发展。

English

While modern visual generation models excel at creating aesthetically pleasing natural images, they struggle with producing or editing structured visuals like charts, diagrams, and mathematical figures, which demand composition planning, text rendering, and multimodal reasoning for factual fidelity. To address this, we present the first comprehensive, systematic investigation of this domain, encompassing data construction, model training, and an evaluation benchmark. First, we construct a large-scale dataset of 1.3 million high-quality structured image pairs derived from executable drawing programs and augmented with chain-of-thought reasoning annotations. Building on it, we train a unified model that integrates a VLM with FLUX.1 Kontext via a lightweight connector for enhanced multimodal understanding. A three-stage training curriculum enables progressive feature alignment, knowledge infusion, and reasoning-augmented generation, further boosted by an external reasoner at inference time. Finally, we introduce StructBench, a novel benchmark for generation and editing with over 1,700 challenging instances, and an accompanying evaluation metric, StructScore, which employs a multi-round Q\&A protocol to assess fine-grained factual accuracy. Evaluations of 15 models reveal that even leading closed-source systems remain far from satisfactory. Our model attains strong editing performance, and inference-time reasoning yields consistent gains across diverse architectures. By releasing the dataset, model, and benchmark, we aim to advance unified multimodal foundations for structured visuals.

事实性至关重要：当图像生成与编辑遇上结构化视觉元素

Factuality Matters: When Image Generation and Editing Meet Structured Visuals

摘要

Support