Visual-ERM：視覚的等価性のための報酬モデリング

要旨

視覚からコードへのタスクでは、モデルがチャート、表、SVGなどの構造化された視覚入力を、高い視覚忠実性で実行可能または構造化された表現に再構築する必要があります。近年の大規模視覚言語モデル（LVLM）は教師ありファインチューニングにより強力な結果を達成していますが、強化学習は報酬信号の不整合により依然として困難な課題です。既存の報酬はテキストベースの規則に依存するか、粗い視覚埋め込みの類似度に基づいており、いずれも細かな視覚的差異を捉えられず、報酬ハッキングの影響を受けやすい問題があります。本研究では、レンダリングされた視覚空間で直接ビジョン・トゥ・コードの品質を評価する、細粒度で解釈可能かつタスク非依存のフィードバックを提供するマルチモーダル生成報酬モデル「Visual Equivalence Reward Model（Visual-ERM）」を提案します。Visual-ERMを強化学習に統合することで、Qwen3-VL-8B-Instructはチャートからコードへのタスクで+8.4向上し、表とSVG解析でも一貫した改善（平均+2.7、+4.1）を示し、反射と修正によるテスト時スケーリングも強化されます。さらに、構造化視覚データにおける細粒度画像間差異評価のベンチマーク「VisualCritic-RewardBench（VC-RewardBench）」を導入し、8BパラメータのVisual-ERMがQwen3-VL-235B-Instructを決定的に上回り、先進的なクローズドソースモデルに迫る性能を達成しました。我々の結果は、細粒度の視覚的報酬監督がタスク特異性に関わらず、ビジョン・トゥ・コード強化学習において必要かつ十分であることを示唆しています。

English

Vision-to-code tasks require models to reconstruct structured visual inputs, such as charts, tables, and SVGs, into executable or structured representations with high visual fidelity. While recent Large Vision Language Models (LVLMs) achieve strong results via supervised fine-tuning, reinforcement learning remains challenging due to misaligned reward signals. Existing rewards either rely on textual rules or coarse visual embedding similarity, both of which fail to capture fine-grained visual discrepancies and are vulnerable to reward hacking. We propose Visual Equivalence Reward Model (Visual-ERM), a multimodal generative reward model that provides fine-grained, interpretable, and task-agnostic feedback to evaluate vision-to-code quality directly in the rendered visual space. Integrated into RL, Visual-ERM improves Qwen3-VL-8B-Instruct by +8.4 on chart-to-code and yields consistent gains on table and SVG parsing (+2.7, +4.1 on average), and further strengthens test-time scaling via reflection and revision. We also introduce VisualCritic-RewardBench (VC-RewardBench), a benchmark for judging fine-grained image-to-image discrepancies on structured visual data, where Visual-ERM at 8B decisively outperforms Qwen3-VL-235B-Instruct and approaches leading closed-source models. Our results suggest that fine-grained visual reward supervision is both necessary and sufficient for vision-to-code RL, regardless of task specificity.

Visual-ERM：視覚的等価性のための報酬モデリング

Visual-ERM: Reward Modeling for Visual Equivalence

要旨

Support