ARM-Thinker:透過智能工具運用與視覺推理強化多模態生成獎勵模型
ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning
December 4, 2025
作者: Shengyuan Ding, Xinyu Fang, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiangyu Zhao, Haodong Duan, Xiaoyi Dong, Jianze Liang, Bin Wang, Conghui He, Dahua Lin, Jiaqi Wang
cs.AI
摘要
獎勵模型對於將視覺語言系統與人類偏好對齊至關重要,但現有方法存在幻覺問題、視覺基礎薄弱且無法使用工具進行驗證,限制了其在複雜多模態推理任務上的可靠性。我們提出ARM-Thinker,一種能自主調用外部工具(如圖像裁剪、文檔頁面檢索)的代理式多模態獎勵模型,通過可驗證證據來支撐判斷,取代靜態非交互式的獎勵評分。該模型能驗證細粒度視覺細節、交叉引用多頁證據並檢驗推理主張,這些能力是現有獎勵模型所欠缺的。我們採用多階段強化學習訓練ARM-Thinker,聯合優化工具調用決策與判斷準確性。為評估代理式獎勵建模,我們推出ARMBench-VL基準套件,包含三項測試:細粒度視覺基礎(圖像級工具)、多頁文檔理解(檢索工具)和指令遵循(文本級驗證)。ARM-Thinker在獎勵建模基準上實現平均+16.2%的提升,工具使用任務提升+9.6%,並在多模態數學與邏輯推理基準上超越基線模型。實驗結果表明,代理能力能顯著增強獎勵模型的準確性與可解釋性。
English
Reward models are critical for aligning vision-language systems with human preferences, yet current approaches suffer from hallucination, weak visual grounding, and an inability to use tools for verification, limiting their reliability on complex multimodal reasoning tasks. We present ARM-Thinker, an A}gentic multimodal Reward Model that autonomously invokes external tools (e.g., image cropping, doc page retrieval) to ground judgments in verifiable evidence, replacing static, non-interactive reward scoring. This enables the model to verify fine-grained visual details, cross-reference multi-page evidence, and validate reasoning claims, which are capabilities absent in existing reward models. We train ARM-Thinker with multi-stage reinforcement learning, jointly optimizing tool-calling decisions and judgment accuracy. To evaluate agentic reward modeling, we introduce ARMBench-VL, comprising three benchmarks that assess fine-grained visual grounding (image-level tools), multi-page document understanding (retrieval tools), and instruction following (text-level verification). ARM-Thinker achieves +16.2% average improvement on reward modeling benchmarks, +9.6% on tool-use tasks, and outperforms baselines on multimodal math and logical reasoning benchmarks. Our results demonstrate that agentic capabilities significantly enhance both accuracy and interpretability of reward models.