视觉ERM:面向视觉等价性的奖励建模
Visual-ERM: Reward Modeling for Visual Equivalence
March 13, 2026
作者: Ziyu Liu, Shengyuan Ding, Xinyu Fang, Xuanlang Dai, Penghui Yang, Jianze Liang, Jiaqi Wang, Kai Chen, Dahua Lin, Yuhang Zang
cs.AI
摘要
视觉到代码任务要求模型将结构化视觉输入(如图表、表格和SVG)重建为具有高视觉保真度的可执行或结构化表示。虽然当前的大型视觉语言模型通过监督微调取得了优异效果,但由于奖励信号失准,强化学习仍面临挑战。现有奖励机制要么依赖文本规则,要么采用粗粒度的视觉嵌入相似度,两者均无法捕捉细粒度视觉差异且易受奖励破解影响。我们提出视觉等价奖励模型——一种多模态生成式奖励模型,可在渲染视觉空间中直接评估视觉到代码的质量,提供细粒度、可解释且与任务无关的反馈。该模型集成至强化学习后,将Qwen3-VL-8B-Instruct在图表到代码任务上的性能提升8.4分,并在表格与SVG解析任务上实现稳定增益(平均提升2.7分和4.1分),同时通过反思修订机制进一步强化测试时扩展能力。我们还推出VisualCritic-RewardBench基准,专门评估结构化视觉数据的细粒度图像差异识别能力。实验表明,8B参数的Visual-ERM显著超越Qwen3-VL-235B-Instruct,并逼近领先的闭源模型性能。我们的研究证实,无论任务特异性如何,细粒度视觉奖励监督对视觉到代码的强化学习既必要又充分。
English
Vision-to-code tasks require models to reconstruct structured visual inputs, such as charts, tables, and SVGs, into executable or structured representations with high visual fidelity. While recent Large Vision Language Models (LVLMs) achieve strong results via supervised fine-tuning, reinforcement learning remains challenging due to misaligned reward signals. Existing rewards either rely on textual rules or coarse visual embedding similarity, both of which fail to capture fine-grained visual discrepancies and are vulnerable to reward hacking. We propose Visual Equivalence Reward Model (Visual-ERM), a multimodal generative reward model that provides fine-grained, interpretable, and task-agnostic feedback to evaluate vision-to-code quality directly in the rendered visual space. Integrated into RL, Visual-ERM improves Qwen3-VL-8B-Instruct by +8.4 on chart-to-code and yields consistent gains on table and SVG parsing (+2.7, +4.1 on average), and further strengthens test-time scaling via reflection and revision. We also introduce VisualCritic-RewardBench (VC-RewardBench), a benchmark for judging fine-grained image-to-image discrepancies on structured visual data, where Visual-ERM at 8B decisively outperforms Qwen3-VL-235B-Instruct and approaches leading closed-source models. Our results suggest that fine-grained visual reward supervision is both necessary and sufficient for vision-to-code RL, regardless of task specificity.