ARM-Thinker:通过智能工具运用与视觉推理增强多模态生成式奖励模型
ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning
December 4, 2025
作者: Shengyuan Ding, Xinyu Fang, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiangyu Zhao, Haodong Duan, Xiaoyi Dong, Jianze Liang, Bin Wang, Conghui He, Dahua Lin, Jiaqi Wang
cs.AI
摘要
奖励模型对于使视觉-语言系统与人类偏好对齐至关重要,然而现有方法存在幻觉问题、视觉基础薄弱且无法利用工具进行验证,限制了其在复杂多模态推理任务中的可靠性。我们提出ARM-Thinker——一种能自主调用外部工具(如图像裁剪、文档页面检索)的智能多模态奖励模型,通过可验证证据支撑判断,取代静态非交互式奖励评分。该模型能够验证细粒度视觉细节、交叉引用多页证据并检验推理主张,这些能力是现有奖励模型所缺失的。我们采用多阶段强化学习训练ARM-Thinker,联合优化工具调用决策与判断准确性。为评估智能奖励建模,我们推出ARMBench-VL基准套件,包含三个测试集:评估细粒度视觉基础(图像级工具)、多页文档理解(检索工具)和指令遵循(文本级验证)。ARM-Thinker在奖励模型基准上实现平均16.2%的性能提升,在工具使用任务中提升9.6%,并在多模态数学与逻辑推理基准上超越基线方法。实验结果表明,智能能力显著提升了奖励模型的准确性与可解释性。
English
Reward models are critical for aligning vision-language systems with human preferences, yet current approaches suffer from hallucination, weak visual grounding, and an inability to use tools for verification, limiting their reliability on complex multimodal reasoning tasks. We present ARM-Thinker, an A}gentic multimodal Reward Model that autonomously invokes external tools (e.g., image cropping, doc page retrieval) to ground judgments in verifiable evidence, replacing static, non-interactive reward scoring. This enables the model to verify fine-grained visual details, cross-reference multi-page evidence, and validate reasoning claims, which are capabilities absent in existing reward models. We train ARM-Thinker with multi-stage reinforcement learning, jointly optimizing tool-calling decisions and judgment accuracy. To evaluate agentic reward modeling, we introduce ARMBench-VL, comprising three benchmarks that assess fine-grained visual grounding (image-level tools), multi-page document understanding (retrieval tools), and instruction following (text-level verification). ARM-Thinker achieves +16.2% average improvement on reward modeling benchmarks, +9.6% on tool-use tasks, and outperforms baselines on multimodal math and logical reasoning benchmarks. Our results demonstrate that agentic capabilities significantly enhance both accuracy and interpretability of reward models.