VFIG: ビジョン言語モデルを用いたSVGにおける複雑図形のベクタライゼーション

要旨

スケーラブルベクターグラフィックス（SVG）は、解像度に依存しない正確な描画と柔軟な意味的編集性を提供する、技術図面やデジタルデザインにおいて不可欠なフォーマットである。しかし実際には、元のベクターソースファイルが失われたりアクセス不能になったりすることが多く、修正や拡縮が困難な「フラット」なラスター化版（PNGやJPEGなど）のみが残される場合が多い。これらの図を手動で再構築するには専門的な知識が必要で、本来の幾何学的意図を復元するには非常に労力がかかる。この問題を解決するため、我々は複雑で高精細な図形からSVGへの変換を目的として訓練されたビジョン言語モデル群、VFIGを提案する。このタスクは本質的にデータ駆動型であるが、既存のデータセットは小規模で、専門的な図表の複雑さを欠く場合が多い。この問題に対処するため、実世界の論文図表と手続き的に生成された図表を多様に組み合わせて収集・精選した、66Kの高品質な図形-SVGペアからなる大規模データセットVFIG-DATAを導入する。SVGが反復出現するプリミティブと階層的な局所構造で構成されていることを踏まえ、原子論的プリミティブの学習から始まる教師ありファインチューニング（SFT）と、図全体の忠実度、レイアウトの一貫性、位相的なエッジケースを最適化するための強化学習（RL）による洗練へと移行する、粗い粒度から細かい粒度への訓練カリキュラムを導入する。最後に、複雑な図形の構造的完全性を測定するために設計された新規指標を含む総合的な評価スイート、VFIG-BENCHを提案する。VFIGはオープンソースモデルの中で最先端の性能を達成し、GPT-5.2と同等の性能を示し、VFIG-BENCHにおいてVLM-Judgeスコア0.829を達成した。

English

Scalable Vector Graphics (SVG) are an essential format for technical illustration and digital design, offering precise resolution independence and flexible semantic editability. In practice, however, original vector source files are frequently lost or inaccessible, leaving only "flat" rasterized versions (e.g., PNG or JPEG) that are difficult to modify or scale. Manually reconstructing these figures is a prohibitively labor-intensive process, requiring specialized expertise to recover the original geometric intent. To bridge this gap, we propose VFIG, a family of Vision-Language Models trained for complex and high-fidelity figure-to-SVG conversion. While this task is inherently data-driven, existing datasets are typically small-scale and lack the complexity of professional diagrams. We address this by introducing VFIG-DATA, a large-scale dataset of 66K high-quality figure-SVG pairs, curated from a diverse mix of real-world paper figures and procedurally generated diagrams. Recognizing that SVGs are composed of recurring primitives and hierarchical local structures, we introduce a coarse-to-fine training curriculum that begins with supervised fine-tuning (SFT) to learn atomic primitives and transitions to reinforcement learning (RL) refinement to optimize global diagram fidelity, layout consistency, and topological edge cases. Finally, we introduce VFIG-BENCH, a comprehensive evaluation suite with novel metrics designed to measure the structural integrity of complex figures. VFIG achieves state-of-the-art performance among open-source models and performs on par with GPT-5.2, achieving a VLM-Judge score of 0.829 on VFIG-BENCH.

VFIG: ビジョン言語モデルを用いたSVGにおける複雑図形のベクタライゼーション

VFIG: Vectorizing Complex Figures in SVG with Vision-Language Models

要旨

Support