사실성의 중요성: 이미지 생성 및 편집이 구조화된 시각 자료와 만날 때

초록

현대 시각 생성 모델은 미학적으로 매력적인 자연 이미지를 생성하는 데 뛰어나지만, 구성 계획, 텍스트 렌더링, 사실적 정확성을 위한 다중모달 추론이 요구되는 차트, 다이어그램, 수학적 도표와 같은 구조화된 시각 자료를 생성하거나 편집하는 데는 어려움을 겪습니다. 이를 해결하기 위해, 우리는 데이터 구축, 모델 학습, 평가 벤치마크를 포괄하는 이 분야의 첫 번째 종합적이고 체계적인 연구를 제시합니다. 먼저, 실행 가능한 드로잉 프로그램에서 파생된 130만 개의 고품질 구조화된 이미지 쌍으로 구성된 대규모 데이터셋을 구축하고, 이를 사고 연쇄 추론 주석으로 보강합니다. 이를 바탕으로, 경량 커넥터를 통해 VLM과 FLUX.1 Kontext를 통합한 통합 모델을 학습시킵니다. 3단계 학습 커리큘럼은 점진적인 특징 정렬, 지식 주입, 추론 강화 생성을 가능하게 하며, 추론 시 외부 추론기를 통해 더욱 향상됩니다. 마지막으로, 1,700개 이상의 도전적인 인스턴스로 구성된 생성 및 편집을 위한 새로운 벤치마크인 StructBench와 이를 평가하기 위한 StructScore 메트릭을 소개합니다. StructScore는 다중 라운드 Q&A 프로토콜을 사용하여 세부적인 사실적 정확성을 평가합니다. 15개 모델에 대한 평가 결과, 선도적인 클로즈드 소스 시스템조차도 만족스러운 수준에 이르지 못하는 것으로 나타났습니다. 우리의 모델은 강력한 편집 성능을 보여주며, 추론 시 추론은 다양한 아키텍처에서 일관된 성능 향상을 가져옵니다. 데이터셋, 모델, 벤치마크를 공개함으로써, 우리는 구조화된 시각 자료를 위한 통합 다중모달 기반을 발전시키고자 합니다.

English

While modern visual generation models excel at creating aesthetically pleasing natural images, they struggle with producing or editing structured visuals like charts, diagrams, and mathematical figures, which demand composition planning, text rendering, and multimodal reasoning for factual fidelity. To address this, we present the first comprehensive, systematic investigation of this domain, encompassing data construction, model training, and an evaluation benchmark. First, we construct a large-scale dataset of 1.3 million high-quality structured image pairs derived from executable drawing programs and augmented with chain-of-thought reasoning annotations. Building on it, we train a unified model that integrates a VLM with FLUX.1 Kontext via a lightweight connector for enhanced multimodal understanding. A three-stage training curriculum enables progressive feature alignment, knowledge infusion, and reasoning-augmented generation, further boosted by an external reasoner at inference time. Finally, we introduce StructBench, a novel benchmark for generation and editing with over 1,700 challenging instances, and an accompanying evaluation metric, StructScore, which employs a multi-round Q\&A protocol to assess fine-grained factual accuracy. Evaluations of 15 models reveal that even leading closed-source systems remain far from satisfactory. Our model attains strong editing performance, and inference-time reasoning yields consistent gains across diverse architectures. By releasing the dataset, model, and benchmark, we aim to advance unified multimodal foundations for structured visuals.

사실성의 중요성: 이미지 생성 및 편집이 구조화된 시각 자료와 만날 때

Factuality Matters: When Image Generation and Editing Meet Structured Visuals

초록

Support