事实性至关重要:当图像生成与编辑遇上结构化视觉元素
Factuality Matters: When Image Generation and Editing Meet Structured Visuals
October 6, 2025
作者: Le Zhuo, Songhao Han, Yuandong Pu, Boxiang Qiu, Sayak Paul, Yue Liao, Yihao Liu, Jie Shao, Xi Chen, Si Liu, Hongsheng Li
cs.AI
摘要
尽管现代视觉生成模型在创作美观的自然图像方面表现出色,但在生成或编辑结构化视觉内容(如图表、示意图和数学图形)时却面临挑战,这些任务需要构图规划、文本渲染以及多模态推理以确保事实准确性。为此,我们首次对这一领域进行了全面系统的研究,涵盖了数据构建、模型训练及评估基准的建立。首先,我们构建了一个包含130万对高质量结构化图像的大规模数据集,这些图像源自可执行的绘图程序,并通过链式思维推理注释进行了增强。在此基础上,我们训练了一个统一模型,该模型通过轻量级连接器将视觉语言模型(VLM)与FLUX.1 Kontext集成,以增强多模态理解能力。采用三阶段训练课程,实现了渐进式特征对齐、知识注入及推理增强的生成,并在推理时通过外部推理器进一步提升了性能。最后,我们推出了StructBench,这是一个包含1700多个挑战性实例的生成与编辑新基准,并配套了StructScore评估指标,该指标采用多轮问答协议来评估细粒度的事实准确性。对15个模型的评估显示,即便是领先的闭源系统也远未达到令人满意的水平。我们的模型在编辑性能上表现强劲,推理时的推理能力在不同架构中均带来了持续的提升。通过公开数据集、模型及基准,我们旨在推动结构化视觉内容统一多模态基础的发展。
English
While modern visual generation models excel at creating aesthetically
pleasing natural images, they struggle with producing or editing structured
visuals like charts, diagrams, and mathematical figures, which demand
composition planning, text rendering, and multimodal reasoning for factual
fidelity. To address this, we present the first comprehensive, systematic
investigation of this domain, encompassing data construction, model training,
and an evaluation benchmark. First, we construct a large-scale dataset of 1.3
million high-quality structured image pairs derived from executable drawing
programs and augmented with chain-of-thought reasoning annotations. Building on
it, we train a unified model that integrates a VLM with FLUX.1 Kontext via a
lightweight connector for enhanced multimodal understanding. A three-stage
training curriculum enables progressive feature alignment, knowledge infusion,
and reasoning-augmented generation, further boosted by an external reasoner at
inference time. Finally, we introduce StructBench, a novel benchmark for
generation and editing with over 1,700 challenging instances, and an
accompanying evaluation metric, StructScore, which employs a multi-round Q\&A
protocol to assess fine-grained factual accuracy. Evaluations of 15 models
reveal that even leading closed-source systems remain far from satisfactory.
Our model attains strong editing performance, and inference-time reasoning
yields consistent gains across diverse architectures. By releasing the dataset,
model, and benchmark, we aim to advance unified multimodal foundations for
structured visuals.