InternSVG：マルチモーダル大規模言語モデルを用いた統一SVGタスクへのアプローチ

要旨

一般的なSVGモデリングは、断片化されたデータセット、タスク間での手法の転移性の限界、および構造的複雑性の取り扱いの難しさにより、依然として課題が多い。これに対応するため、我々はマルチモーダル大規模言語モデル（MLLM）の強力な転移および汎化能力を活用し、SVGの理解、編集、生成のための統一モデリングを実現する。本論文では、統合されたデータ・ベンチマーク・モデルスイートであるInternSVGファミリーを提案する。その中核となるのは、SVGタスクのための最大かつ最も包括的なマルチモーダルデータセットであるSAgogeであり、静的グラフィックスと動的アニメーションの両方を包含している。このデータセットは、アイコン、長いシーケンスのイラスト、科学図表、動的アニメーションをカバーし、様々な難易度のタスクをサポートし、従来のデータセットと比較してより深い階層構造と豊富な属性を提供する。このリソースに基づいて、SAgogeがカバーする領域と難易度スペクトルに沿った包括的なタスク定義と標準化された評価を備えたコンパニオンベンチマークであるSArenaを導入する。これらの基盤に基づき、SVG固有の特殊トークン、サブワードベースの埋め込み初期化、および短い静的SVGから長いシーケンスのイラストや複雑なアニメーションへと進む二段階トレーニング戦略を備えた、SVGの理解、編集、生成のための統一MLLMであるInternSVGを提案する。この統一的な定式化は、正の転移を誘発し、全体的な性能を向上させる。SArenaおよび従来のベンチマークでの実験により、InternSVGが大幅な向上を達成し、主要なオープンおよびプロプライエタリの競合モデルを一貫して上回ることが確認された。

English

General SVG modeling remains challenging due to fragmented datasets, limited transferability of methods across tasks, and the difficulty of handling structural complexity. In response, we leverage the strong transfer and generalization capabilities of multimodal large language models (MLLMs) to achieve unified modeling for SVG understanding, editing, and generation. We present the InternSVG family, an integrated data-benchmark-model suite. At its core is SAgoge, the largest and most comprehensive multimodal dataset for SVG tasks, encompassing both static graphics and dynamic animations. It covers icons, long-sequence illustrations, scientific diagrams, and dynamic animations, supporting tasks of varied difficulty levels and providing deeper hierarchies with richer attributes compared to previous datasets. Based on this resource, we introduce SArena, a companion benchmark with comprehensive task definitions and standardized evaluation that aligns with the domains and difficulty spectrum covered by SAgoge. Building on these foundations, we propose InternSVG, a unified MLLM for SVG understanding, editing, and generation with SVG-specific special tokens, subword-based embedding initialization, and a two-stage training strategy that progresses from short static SVGs to long-sequence illustrations and complex animations. This unified formulation induces positive transfer and improves overall performance. Experiments on SArena and prior benchmark confirm that InternSVG achieves substantial gains and consistently outperforms leading open and proprietary counterparts.

InternSVG：マルチモーダル大規模言語モデルを用いた統一SVGタスクへのアプローチ

InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models

要旨

Support