Visual-ERM: 시각적 등가성을 위한 보상 모델링

초록

비전-투-코드 작업은 차트, 표, SVG와 같은 구조화된 시각적 입력을 높은 시각적 충실도로 실행 가능하거나 구조화된 표현으로 재구성하는 모델을 요구합니다. 최근 대규모 시각 언어 모델(LVLM)이 지도 미세 조정을 통해 강력한 성과를 보이고 있지만, 강화 학습은 정렬되지 않은 보상 신호로 인해 여전히 어려움을 겪고 있습니다. 기존 보상 방식은 텍스트 기반 규칙에 의존하거나 거시적인 시각 임베딩 유사성을 활용하는데, 둘 모두 미세한 시각적 불일치를 포착하지 못하며 보상 해킹에 취약합니다. 우리는 렌더링된 시각 공간에서 직접 비전-투-코드 품질을 평가하기 위해 미세 단위의, 해석 가능하며, 작업에 독립적인 피드백을 제공하는 다중모달 생성형 보상 모델인 Visual Equivalence Reward Model (Visual-ERM)을 제안합니다. 강화 학습에 통합된 Visual-ERM은 차트-투-코드에서 Qwen3-VL-8B-Instruct의 성능을 +8.4만큼 향상시키고, 표 및 SVG 구문 분석에서도 일관된 성능 향상(+2.7, 평균 +4.1)을 보이며, 반성 및 수정을 통한 테스트 타임 스케일링을 추가로 강화합니다. 또한 구조화된 시각 데이터에 대한 미세 단위 이미지-투-이미지 불일치 판단을 위한 벤치마크인 VisualCritic-RewardBench (VC-RewardBench)를 소개하는데, 8B 규모의 Visual-ERM은 Qwen3-VL-235B-Instruct를 결정적으로 능가하며 최고의 폐쇄형 모델에 근접하는 성능을 보입니다. 우리의 결과는 미세 단위 시각 보상 지도가 작업 특수성과 무관하게 비전-투-코드 강화 학습에 필요하며 충분한 조건임을 시사합니다.

English

Vision-to-code tasks require models to reconstruct structured visual inputs, such as charts, tables, and SVGs, into executable or structured representations with high visual fidelity. While recent Large Vision Language Models (LVLMs) achieve strong results via supervised fine-tuning, reinforcement learning remains challenging due to misaligned reward signals. Existing rewards either rely on textual rules or coarse visual embedding similarity, both of which fail to capture fine-grained visual discrepancies and are vulnerable to reward hacking. We propose Visual Equivalence Reward Model (Visual-ERM), a multimodal generative reward model that provides fine-grained, interpretable, and task-agnostic feedback to evaluate vision-to-code quality directly in the rendered visual space. Integrated into RL, Visual-ERM improves Qwen3-VL-8B-Instruct by +8.4 on chart-to-code and yields consistent gains on table and SVG parsing (+2.7, +4.1 on average), and further strengthens test-time scaling via reflection and revision. We also introduce VisualCritic-RewardBench (VC-RewardBench), a benchmark for judging fine-grained image-to-image discrepancies on structured visual data, where Visual-ERM at 8B decisively outperforms Qwen3-VL-235B-Instruct and approaches leading closed-source models. Our results suggest that fine-grained visual reward supervision is both necessary and sufficient for vision-to-code RL, regardless of task specificity.

Visual-ERM: 시각적 등가성을 위한 보상 모델링

Visual-ERM: Reward Modeling for Visual Equivalence

초록

Support