視覺等價獎勵建模:視覺-ERM
Visual-ERM: Reward Modeling for Visual Equivalence
March 13, 2026
作者: Ziyu Liu, Shengyuan Ding, Xinyu Fang, Xuanlang Dai, Penghui Yang, Jianze Liang, Jiaqi Wang, Kai Chen, Dahua Lin, Yuhang Zang
cs.AI
摘要
視覺到程式碼任務要求模型將結構化視覺輸入(如圖表、表格和SVG)重建為具有高視覺保真度的可執行或結構化表徵。儘管近期的大型視覺語言模型(LVLM)通過監督微調取得了顯著成果,但由於獎勵信號失準,強化學習仍面臨挑戰。現有獎勵機制要么依賴文本規則,要么採用粗粒度的視覺嵌入相似度,兩者均無法捕捉細粒度視覺差異且易受獎勵破解影響。我們提出視覺等價獎勵模型(Visual-ERM),這是一種多模態生成式獎勵模型,能在渲染視覺空間中直接評估視覺到程式碼的質量,提供細粒度、可解釋且與任務無關的反饋。該模型整合至強化學習後,使Qwen3-VL-8B-Instruct在圖表到程式碼任務上提升+8.4分,並在表格和SVG解析任務上實現穩定增益(平均提升+2.7、+4.1分),同時通過反思與修訂進一步強化測試時擴展能力。我們還推出VisualCritic-RewardBench(VC-RewardBench)基準測試,用於評估結構化視覺數據的細粒度圖像差異。實驗表明,8B參數的Visual-ERM顯著超越Qwen3-VL-235B-Instruct,並逼近領先的閉源模型。研究結果證實,無論任務特性如何,細粒度視覺獎勵監督對視覺到程式碼的強化學習既必要又充分。
English
Vision-to-code tasks require models to reconstruct structured visual inputs, such as charts, tables, and SVGs, into executable or structured representations with high visual fidelity. While recent Large Vision Language Models (LVLMs) achieve strong results via supervised fine-tuning, reinforcement learning remains challenging due to misaligned reward signals. Existing rewards either rely on textual rules or coarse visual embedding similarity, both of which fail to capture fine-grained visual discrepancies and are vulnerable to reward hacking. We propose Visual Equivalence Reward Model (Visual-ERM), a multimodal generative reward model that provides fine-grained, interpretable, and task-agnostic feedback to evaluate vision-to-code quality directly in the rendered visual space. Integrated into RL, Visual-ERM improves Qwen3-VL-8B-Instruct by +8.4 on chart-to-code and yields consistent gains on table and SVG parsing (+2.7, +4.1 on average), and further strengthens test-time scaling via reflection and revision. We also introduce VisualCritic-RewardBench (VC-RewardBench), a benchmark for judging fine-grained image-to-image discrepancies on structured visual data, where Visual-ERM at 8B decisively outperforms Qwen3-VL-235B-Instruct and approaches leading closed-source models. Our results suggest that fine-grained visual reward supervision is both necessary and sufficient for vision-to-code RL, regardless of task specificity.