VCode:一种以SVG为符号视觉表示的多模态编程基准
VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation
November 4, 2025
作者: Kevin Qinghong Lin, Yuhao Zheng, Hangyu Ran, Dantong Zhu, Dongxing Mao, Linjie Li, Philip Torr, Alex Jinpeng Wang
cs.AI
摘要
在智能体时代,代码已成为一种精确且可执行的推理与行动媒介。然而当前进展主要集中于程序合成与调试等以语言为中心的任务,视觉导向的编程领域仍待深入探索。受人类通过草图进行推理的启发,我们提出将SVG代码作为紧凑、可解释且可执行的视觉表征。我们推出VCode基准测试,将多模态理解重新定义为代码生成任务:给定图像,模型需生成能保留符号意义以供下游推理的SVG代码。VCode涵盖三大领域——通用常识(MM-Vet)、专业学科(MMMU)及视觉导向感知(CV-Bench)。为评估符号保真度,我们提出CodeVQA创新评估协议:通过策略模型对渲染后的SVG进行问答,正确答案表明符号得到了忠实保留。实验表明,前沿视觉语言模型在生成精确SVG时仍存在困难,暴露出语言导向与视觉导向编程之间的显著差距。为弥合这一差距,我们推出VCoder智能体框架,从两个维度增强视觉语言模型:(i)修订式思考,通过迭代分析差异并优化SVG代码;(ii)视觉工具协同,借助检测器和解析器提供模型固有能力之外的物体、形状和文本等结构化线索。在各项基准测试中,具有强推理能力的前沿模型总体表现良好,但在专业知识和三维推理方面仍有局限。VCoder相较性能最优的Claude-4-Opus实现了12.3分的综合提升。人类研究表明,人类与模型在渲染SVG上的表现均有所下降,但二者的一致性揭示了符号化视觉表征的潜力。基准测试与代码已开源:https://github.com/CSU-JPG/VCode。
English
Code has emerged as a precise and executable medium for reasoning and action
in the agent era. Yet, progress has largely focused on language-centric tasks
such as program synthesis and debugging, leaving visual-centric coding
underexplored. Inspired by how humans reason over sketches, we advocate SVG
code as a compact, interpretable, and executable visual representation. We
introduce VCode, a benchmark that reframes multimodal understanding as code
generation: given an image, a model must produce SVG that preserves symbolic
meaning for downstream reasoning. VCode covers three domains - general
commonsense (MM-Vet), professional disciplines (MMMU), and visual-centric
perception (CV-Bench). To assess symbolic fidelity, we propose CodeVQA, a novel
evaluation protocol in which a policy model answers questions over rendered
SVGs; correct answers indicate faithful symbolic preservation. Empirically,
frontier VLMs struggle to generate faithful SVGs, revealing a persistent gap
between language-centric and visual-centric coding. To close this gap, we
introduce VCoder, an agentic framework that augments VLMs along two axes: (i)
Thinking with Revision, which iteratively analyzes discrepancies and refines
SVG code; and (ii) Acting with Visual Tools, where detectors and parsers supply
structured cues such as objects, shapes, and text beyond the model's intrinsic
capacity. Across benchmarks, frontier VLMs with strong reasoning capabilities
score well overall yet remain limited in professional knowledge and 3D
reasoning. VCoder delivers a 12.3-point overall gain over the top-performing
Claude-4-Opus. Human studies show that both humans and VLMs perform worse on
rendered SVGs, their consistency reveals the promise of symbolic visual
representation. The benchmark and code are available at
https://github.com/CSU-JPG/VCode.