VG-Refiner:基于智能体强化学习的工具精化指称接地推理研究
VG-Refiner: Towards Tool-Refined Referring Grounded Reasoning via Agentic Reinforcement Learning
December 6, 2025
作者: Yuji Wang, Wenlong Liu, Jingxuan Niu, Haoji Zhang, Yansong Tang
cs.AI
摘要
工具集成式视觉推理(TiVR)在增强多模态问题解决能力方面展现出巨大潜力。然而,现有TiVR范式主要聚焦于通过强化学习整合各类视觉工具,却忽视了针对不可靠或错误工具输出设计有效响应机制。这一局限在指代与定位任务中尤为突出——不准确的检测工具预测常误导TiVR模型产生幻觉推理。为解决该问题,我们提出VG-Refiner框架,这是首个面向工具精修式指代定位推理的解决方案。技术上,我们引入包含"思考-再思考"两阶段机制,使模型能显式分析并响应工具反馈,同时设计精修奖励机制以激励模型针对不良工具结果进行有效修正。此外,我们提出两项新指标并建立公平评估协议,系统化衡量现有模型的精修能力。通过采用少量任务特定数据增强VG-Refiner的精修能力,我们在指代与推理定位基准测试中实现了准确率和修正能力的显著提升,同时保持了预训练模型的通用能力。
English
Tool-integrated visual reasoning (TiVR) has demonstrated great potential in enhancing multimodal problem-solving. However, existing TiVR paradigms mainly focus on integrating various visual tools through reinforcement learning, while neglecting to design effective response mechanisms for handling unreliable or erroneous tool outputs. This limitation is particularly pronounced in referring and grounding tasks, where inaccurate detection tool predictions often mislead TiVR models into generating hallucinated reasoning. To address this issue, we propose the VG-Refiner, the first framework aiming at the tool-refined referring grounded reasoning. Technically, we introduce a two-stage think-rethink mechanism that enables the model to explicitly analyze and respond to tool feedback, along with a refinement reward that encourages effective correction in response to poor tool results. In addition, we propose two new metrics and establish fair evaluation protocols to systematically measure the refinement ability of current models. We adopt a small amount of task-specific data to enhance the refinement capability of VG-Refiner, achieving a significant improvement in accuracy and correction ability on referring and reasoning grounding benchmarks while preserving the general capabilities of the pretrained model.