VCode：以SVG作為符號化視覺表徵的多模態編碼基準

摘要

在智慧體時代，程式碼已成為精確且可執行的推理與行動媒介。然而現有進展主要聚焦於語言中心任務（如程式合成與除錯），視覺中心程式設計領域仍待探索。受人類透過草圖進行推理的啟發，我們主張將SVG程式碼作為緊湊、可解釋且可執行的視覺表徵。我們提出VCode基準測試，將多模態理解重新定義為程式碼生成任務：給定圖像，模型必須生成能保留符號意義以供下游推理的SVG程式碼。VCode涵蓋三大領域——通用常識（MM-Vet）、專業學科（MMMU）及視覺中心感知（CV-Bench）。為評估符號保真度，我們提出創新的CodeVQA評估協議：透過策略模型對渲染後SVG進行問答，正確答案表徵符號的忠實保留。實證顯示，前沿視覺語言模型在生成精準SVG時仍存在困難，揭示語言中心與視覺中心程式設計間的持續差距。為彌合此差距，我們推出VCoder智慧體框架，沿雙軸線增強視覺語言模型：（i）修正式思考機制，迭代分析差異並優化SVG程式碼；（ii）視覺工具協作，透過偵測器與解析器提供模型內在能力之外的結構化線索（如物件、形狀與文字）。跨基準測試表明，具備強推理能力的前沿模型總體表現良好，但在專業知識與3D推理方面仍有侷限。VCoder相較表現最佳的Claude-4-Opus實現12.3分的綜合提升。人類研究顯示，人類與視覺語言模型在渲染SVG上的表現均有所下降，但兩者的一致性揭示了符號化視覺表徵的發展潛力。基準測試與程式碼已公開於https://github.com/CSU-JPG/VCode。

English

Code has emerged as a precise and executable medium for reasoning and action in the agent era. Yet, progress has largely focused on language-centric tasks such as program synthesis and debugging, leaving visual-centric coding underexplored. Inspired by how humans reason over sketches, we advocate SVG code as a compact, interpretable, and executable visual representation. We introduce VCode, a benchmark that reframes multimodal understanding as code generation: given an image, a model must produce SVG that preserves symbolic meaning for downstream reasoning. VCode covers three domains - general commonsense (MM-Vet), professional disciplines (MMMU), and visual-centric perception (CV-Bench). To assess symbolic fidelity, we propose CodeVQA, a novel evaluation protocol in which a policy model answers questions over rendered SVGs; correct answers indicate faithful symbolic preservation. Empirically, frontier VLMs struggle to generate faithful SVGs, revealing a persistent gap between language-centric and visual-centric coding. To close this gap, we introduce VCoder, an agentic framework that augments VLMs along two axes: (i) Thinking with Revision, which iteratively analyzes discrepancies and refines SVG code; and (ii) Acting with Visual Tools, where detectors and parsers supply structured cues such as objects, shapes, and text beyond the model's intrinsic capacity. Across benchmarks, frontier VLMs with strong reasoning capabilities score well overall yet remain limited in professional knowledge and 3D reasoning. VCoder delivers a 12.3-point overall gain over the top-performing Claude-4-Opus. Human studies show that both humans and VLMs perform worse on rendered SVGs, their consistency reveals the promise of symbolic visual representation. The benchmark and code are available at https://github.com/CSU-JPG/VCode.

VCode：以SVG作為符號化視覺表徵的多模態編碼基準

VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation

摘要

Support