VCode：一种以SVG为符号视觉表示的多模态编程基准

摘要

在智能体时代，代码已成为一种精确且可执行的推理与行动媒介。然而当前进展主要集中于程序合成与调试等以语言为中心的任务，视觉导向的编程领域仍待深入探索。受人类通过草图进行推理的启发，我们提出将SVG代码作为紧凑、可解释且可执行的视觉表征。我们推出VCode基准测试，将多模态理解重新定义为代码生成任务：给定图像，模型需生成能保留符号意义以供下游推理的SVG代码。VCode涵盖三大领域——通用常识（MM-Vet）、专业学科（MMMU）及视觉导向感知（CV-Bench）。为评估符号保真度，我们提出CodeVQA创新评估协议：通过策略模型对渲染后的SVG进行问答，正确答案表明符号得到了忠实保留。实验表明，前沿视觉语言模型在生成精确SVG时仍存在困难，暴露出语言导向与视觉导向编程之间的显著差距。为弥合这一差距，我们推出VCoder智能体框架，从两个维度增强视觉语言模型：（i）修订式思考，通过迭代分析差异并优化SVG代码；（ii）视觉工具协同，借助检测器和解析器提供模型固有能力之外的物体、形状和文本等结构化线索。在各项基准测试中，具有强推理能力的前沿模型总体表现良好，但在专业知识和三维推理方面仍有局限。VCoder相较性能最优的Claude-4-Opus实现了12.3分的综合提升。人类研究表明，人类与模型在渲染SVG上的表现均有所下降，但二者的一致性揭示了符号化视觉表征的潜力。基准测试与代码已开源：https://github.com/CSU-JPG/VCode。

English

Code has emerged as a precise and executable medium for reasoning and action in the agent era. Yet, progress has largely focused on language-centric tasks such as program synthesis and debugging, leaving visual-centric coding underexplored. Inspired by how humans reason over sketches, we advocate SVG code as a compact, interpretable, and executable visual representation. We introduce VCode, a benchmark that reframes multimodal understanding as code generation: given an image, a model must produce SVG that preserves symbolic meaning for downstream reasoning. VCode covers three domains - general commonsense (MM-Vet), professional disciplines (MMMU), and visual-centric perception (CV-Bench). To assess symbolic fidelity, we propose CodeVQA, a novel evaluation protocol in which a policy model answers questions over rendered SVGs; correct answers indicate faithful symbolic preservation. Empirically, frontier VLMs struggle to generate faithful SVGs, revealing a persistent gap between language-centric and visual-centric coding. To close this gap, we introduce VCoder, an agentic framework that augments VLMs along two axes: (i) Thinking with Revision, which iteratively analyzes discrepancies and refines SVG code; and (ii) Acting with Visual Tools, where detectors and parsers supply structured cues such as objects, shapes, and text beyond the model's intrinsic capacity. Across benchmarks, frontier VLMs with strong reasoning capabilities score well overall yet remain limited in professional knowledge and 3D reasoning. VCoder delivers a 12.3-point overall gain over the top-performing Claude-4-Opus. Human studies show that both humans and VLMs perform worse on rendered SVGs, their consistency reveals the promise of symbolic visual representation. The benchmark and code are available at https://github.com/CSU-JPG/VCode.

VCode：一种以SVG为符号视觉表示的多模态编程基准

VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation

摘要

Support