VFIG：基于视觉语言模型的复杂图形SVG矢量化解构

摘要

可缩放矢量图形（SVG）作为技术插图和数字设计的核心格式，具有精确的分辨率无关性与灵活的语义可编辑性。然而在实际应用中，原始矢量源文件常常丢失或无法获取，仅存难以修改或缩放的"扁平化"栅格化版本（如PNG或JPEG）。手动重建这些图形需要耗费大量人力且需专业经验来还原原始几何意图。为弥补这一鸿沟，我们提出VFIG系列视觉语言模型，专门针对复杂高保真图形到SVG的转换任务进行训练。尽管该任务本质依赖数据驱动，但现有数据集通常规模有限且缺乏专业图示的复杂性。为此我们推出VFIG-DATA大规模数据集，包含6.6万组高质量图形-SVG配对数据，涵盖真实论文图示与程序生成图表的多元混合。基于SVG由可复用图元与层次化局部结构构成的特点，我们采用由粗到精的训练策略：首先通过监督微调（SFT）学习原子级图元，继而转入强化学习（RL）优化阶段以提升整体图表保真度、布局一致性与拓扑边缘案例处理能力。最后我们建立VFIG-BENCH综合评估体系，采用创新指标量化复杂图形的结构完整性。实验表明，VFIG在开源模型中实现最先进性能，与GPT-5.2表现相当，在VFIG-BENCH上获得0.829的VLM-Judge评分。

English

Scalable Vector Graphics (SVG) are an essential format for technical illustration and digital design, offering precise resolution independence and flexible semantic editability. In practice, however, original vector source files are frequently lost or inaccessible, leaving only "flat" rasterized versions (e.g., PNG or JPEG) that are difficult to modify or scale. Manually reconstructing these figures is a prohibitively labor-intensive process, requiring specialized expertise to recover the original geometric intent. To bridge this gap, we propose VFIG, a family of Vision-Language Models trained for complex and high-fidelity figure-to-SVG conversion. While this task is inherently data-driven, existing datasets are typically small-scale and lack the complexity of professional diagrams. We address this by introducing VFIG-DATA, a large-scale dataset of 66K high-quality figure-SVG pairs, curated from a diverse mix of real-world paper figures and procedurally generated diagrams. Recognizing that SVGs are composed of recurring primitives and hierarchical local structures, we introduce a coarse-to-fine training curriculum that begins with supervised fine-tuning (SFT) to learn atomic primitives and transitions to reinforcement learning (RL) refinement to optimize global diagram fidelity, layout consistency, and topological edge cases. Finally, we introduce VFIG-BENCH, a comprehensive evaluation suite with novel metrics designed to measure the structural integrity of complex figures. VFIG achieves state-of-the-art performance among open-source models and performs on par with GPT-5.2, achieving a VLM-Judge score of 0.829 on VFIG-BENCH.

VFIG：基于视觉语言模型的复杂图形SVG矢量化解构

VFIG: Vectorizing Complex Figures in SVG with Vision-Language Models

摘要

Support