VCode: SVG를 기호적 시각 표현으로 활용한 멀티모달 코딩 벤치마크

초록

코드는 에이전트 시대에 추론과 행동을 위한 정확하고 실행 가능한 매체로 부상했습니다. 그러나 지금까지의 발전은 프로그램 합성 및 디버깅과 같은 언어 중심 작업에 집중되어 왔으며, 시각 중심 코딩은 상대적으로 덜 탐구되었습니다. 인간이 스케치를 통해 추론하는 방식에서 영감을 받아, 우리는 SVG 코드를 간결하고 해석 가능하며 실행 가능한 시각적 표현으로 제안합니다. 우리는 멀티모달 이해를 코드 생성으로 재정의하는 벤치마크인 VCode를 소개합니다: 주어진 이미지에 대해 모델은 하위 추론을 위한 상징적 의미를 보존하는 SVG를 생성해야 합니다. VCode는 일반 상식(MM-Vet), 전문 분야(MMMU), 시각 중심 인식(CV-Bench)의 세 가지 영역을 포괄합니다. 상징적 정확도를 평가하기 위해 정책 모델이 렌더링된 SVG를 기반으로 질문에 답하는 새로운 평가 프로토콜인 CodeVQA를 제안합니다; 정답은 상징적 정보가 충실히 보존되었음을 나타냅니다. 실험적으로 최첨단 VLM들도 충실한 SVG 생성에 어려움을 겪으며, 언어 중심 코딩과 시각 중심 코딩 사이의 지속적인 격차를 드러냈습니다. 이 격차를 해소하기 위해 우리는 VLM을 두 가지 축을 따라 강화하는 에이전트 프레임워크인 VCoder를 소개합니다: (i) 불일치를 반복적으로 분석하고 SVG 코드를 개선하는 '개정을 통한 사고'(Thinking with Revision), (ii) 모델의 내재적 능력 범위를 넘어서는 객체, 형태, 텍스트 등의 구조적 단서를 감지기와 파서가 제공하는 '시각적 도구를 통한 행동'(Acting with Visual Tools). 벤치마크 전반에서 강력한 추론 능력을 가진 최첨단 VLM들은 전반적으로 높은 점수를 얻었지만 전문 지식과 3D 추론에서는 여전히 한계를 보였습니다. VCoder는 최고 성능의 Claude-4-Opus 대비 12.3점의 전반적 성능 향상을 달성했습니다. 인간 대상 연구 결과, 인간과 VLM 모두 렌더링된 SVG에서 더 낮은 성능을 보였지만, 그 일관성은 상징적 시각 표현의 가능성을 보여줍니다. 벤치마크와 코드는 https://github.com/CSU-JPG/VCode에서 이용 가능합니다.

English

Code has emerged as a precise and executable medium for reasoning and action in the agent era. Yet, progress has largely focused on language-centric tasks such as program synthesis and debugging, leaving visual-centric coding underexplored. Inspired by how humans reason over sketches, we advocate SVG code as a compact, interpretable, and executable visual representation. We introduce VCode, a benchmark that reframes multimodal understanding as code generation: given an image, a model must produce SVG that preserves symbolic meaning for downstream reasoning. VCode covers three domains - general commonsense (MM-Vet), professional disciplines (MMMU), and visual-centric perception (CV-Bench). To assess symbolic fidelity, we propose CodeVQA, a novel evaluation protocol in which a policy model answers questions over rendered SVGs; correct answers indicate faithful symbolic preservation. Empirically, frontier VLMs struggle to generate faithful SVGs, revealing a persistent gap between language-centric and visual-centric coding. To close this gap, we introduce VCoder, an agentic framework that augments VLMs along two axes: (i) Thinking with Revision, which iteratively analyzes discrepancies and refines SVG code; and (ii) Acting with Visual Tools, where detectors and parsers supply structured cues such as objects, shapes, and text beyond the model's intrinsic capacity. Across benchmarks, frontier VLMs with strong reasoning capabilities score well overall yet remain limited in professional knowledge and 3D reasoning. VCoder delivers a 12.3-point overall gain over the top-performing Claude-4-Opus. Human studies show that both humans and VLMs perform worse on rendered SVGs, their consistency reveals the promise of symbolic visual representation. The benchmark and code are available at https://github.com/CSU-JPG/VCode.

VCode: SVG를 기호적 시각 표현으로 활용한 멀티모달 코딩 벤치마크

VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation

초록

Support