VFIG: 비전-언어 모델을 활용한 SVG 내 복잡 도형의 벡터화

초록

스케일러블 벡터 그래픽스(SVG)는 정밀한 해상도 독립성과 유연한 의미론적 편집 기능을 제공하여 기술 일러스트레이션과 디지털 디자인에 필수적인 형식입니다. 그러나 실제로는 원본 벡터 소스 파일이 유실되거나 접근 불가능한 경우가 많아, 수정이나 크기 조정이 어려운 "평평한" 래스터화된 버전(예: PNG 또는 JPEG)만 남게 됩니다. 이러한 도형을 수동으로 재구성하는 것은 전문적인 지식을 요구하며 막대한 노동력이 필요한 과정입니다. 이러한 격차를 해소하기 위해 우리는 복잡하고 고품질의 도형-대-SVG 변환을 위해 훈련된 비전-언어 모델 패밀리인 VFIG를 제안합니다. 이 작업은 본질적으로 데이터 주도적이지만, 기존 데이터셋은 일반적으로 소규모이며 전문적인 다이어그램의 복잡성을 갖추지 못했습니다. 우리는 실제 논문 도형과 절차적으로 생성된 다이어그램을 다양하게 혼합하여 선별한 66K 규모의 고품질 도형-SVG 쌍으로 구성된 대규모 데이터셋인 VFIG-DATA를 도입하여 이 문제를 해결했습니다. SVG가 반복되는 기본 요소와 계층적인 지역 구조로 구성된다는 점을 인식하여, 우리는 원시 기본 요소를 학습하기 위한 지도 미세 조정(SFT)으로 시작하여 전역 다이어그램 충실도, 레이아웃 일관성 및 위상학적 에지 케이스를 최적화하기 위한 강화 학습(RL) 정제로 전환하는 coarse-to-fine 훈련 커리큘럼을 도입했습니다. 마지막으로, 복잡한 도형의 구조적 무결성을 측정하기 위해 설계된 새로운 메트릭을 갖춘 포괄적인 평가 도구인 VFIG-BENCH를 소개합니다. VFIG는 오픈소스 모델 중에서 최첨단 성능을 달성하며 GPT-5.2와 동등한 성능을 보여 VFIG-BENCH에서 VLM-Judge 점수 0.829를 기록했습니다.

English

Scalable Vector Graphics (SVG) are an essential format for technical illustration and digital design, offering precise resolution independence and flexible semantic editability. In practice, however, original vector source files are frequently lost or inaccessible, leaving only "flat" rasterized versions (e.g., PNG or JPEG) that are difficult to modify or scale. Manually reconstructing these figures is a prohibitively labor-intensive process, requiring specialized expertise to recover the original geometric intent. To bridge this gap, we propose VFIG, a family of Vision-Language Models trained for complex and high-fidelity figure-to-SVG conversion. While this task is inherently data-driven, existing datasets are typically small-scale and lack the complexity of professional diagrams. We address this by introducing VFIG-DATA, a large-scale dataset of 66K high-quality figure-SVG pairs, curated from a diverse mix of real-world paper figures and procedurally generated diagrams. Recognizing that SVGs are composed of recurring primitives and hierarchical local structures, we introduce a coarse-to-fine training curriculum that begins with supervised fine-tuning (SFT) to learn atomic primitives and transitions to reinforcement learning (RL) refinement to optimize global diagram fidelity, layout consistency, and topological edge cases. Finally, we introduce VFIG-BENCH, a comprehensive evaluation suite with novel metrics designed to measure the structural integrity of complex figures. VFIG achieves state-of-the-art performance among open-source models and performs on par with GPT-5.2, achieving a VLM-Judge score of 0.829 on VFIG-BENCH.

VFIG: 비전-언어 모델을 활용한 SVG 내 복잡 도형의 벡터화

VFIG: Vectorizing Complex Figures in SVG with Vision-Language Models

초록

Support