VCode: SVGを記号的視覚表現とするマルチモーダルコーディングベンチマーク

要旨

コードは、エージェント時代における推論と行動のための精密で実行可能な媒体として台頭してきた。しかし、その進展はプログラム合成やデバッグといった言語中心タスクに集中し、視覚中心のコーディングは未開拓のままだ。人間がスケッチを通じて推論する方法に着想を得て、我々はSVGコードをコンパクトで解釈可能、かつ実行可能な視覚表現として提唱する。本論文では、マルチモーダル理解をコード生成として再定義するベンチマークVCodeを提案する。具体的には、画像を入力として、下流の推論に必要な記号的意味を保持するSVGを生成するタスクである。VCodeは、一般常識（MM-Vet）、専門分野（MMMU）、視覚中心知覚（CV-Bench）の3領域を網羅する。記号的忠実性を評価するため、ポリシーモデルがレンダリングされたSVGに対して質問に答える新規評価手法CodeVQAを提案する。正しい回答は、記号の忠実な保存を示す。実験的に、最先端のVLMでさえ忠実なSVG生成に苦戦し、言語中心と視覚中心のコーディング間に永続的な隔たりが明らかになった。この隔たりを埋めるため、VLMを二軸で拡張するエージェント型フレームワークVCoderを導入する：（i）差異を反復分析しSVGコードを洗練する「修正を伴う思考」、（ii）検出器とパーサーがモデルの内在能力を超えたオブジェクト・形状・テキストなどの構造化手がかりを提供する「視覚ツールを伴う行動」である。ベンチマーク全体では、強力な推論能力を持つ最先端VLMは総合的に高得点を示すものの、専門知識と3D推論には限界が残る。VCoderは最高性能のClaude-4-Opusを12.3ポイント上回る。人間評価では、人間とVLMの両方がレンダリングSVGで性能低下するが、その一貫性は記号的視覚表現の可能性を示唆する。ベンチマークとコードはhttps://github.com/CSU-JPG/VCodeで公開されている。

English

Code has emerged as a precise and executable medium for reasoning and action in the agent era. Yet, progress has largely focused on language-centric tasks such as program synthesis and debugging, leaving visual-centric coding underexplored. Inspired by how humans reason over sketches, we advocate SVG code as a compact, interpretable, and executable visual representation. We introduce VCode, a benchmark that reframes multimodal understanding as code generation: given an image, a model must produce SVG that preserves symbolic meaning for downstream reasoning. VCode covers three domains - general commonsense (MM-Vet), professional disciplines (MMMU), and visual-centric perception (CV-Bench). To assess symbolic fidelity, we propose CodeVQA, a novel evaluation protocol in which a policy model answers questions over rendered SVGs; correct answers indicate faithful symbolic preservation. Empirically, frontier VLMs struggle to generate faithful SVGs, revealing a persistent gap between language-centric and visual-centric coding. To close this gap, we introduce VCoder, an agentic framework that augments VLMs along two axes: (i) Thinking with Revision, which iteratively analyzes discrepancies and refines SVG code; and (ii) Acting with Visual Tools, where detectors and parsers supply structured cues such as objects, shapes, and text beyond the model's intrinsic capacity. Across benchmarks, frontier VLMs with strong reasoning capabilities score well overall yet remain limited in professional knowledge and 3D reasoning. VCoder delivers a 12.3-point overall gain over the top-performing Claude-4-Opus. Human studies show that both humans and VLMs perform worse on rendered SVGs, their consistency reveals the promise of symbolic visual representation. The benchmark and code are available at https://github.com/CSU-JPG/VCode.

VCode: SVGを記号的視覚表現とするマルチモーダルコーディングベンチマーク

VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation

要旨

Support