VG-Refiner: エージェント的強化学習によるツール精緻化を目指した参照接地推論

要旨

ツール統合型視覚推論（TiVR）は、マルチモーダル問題解決の強化において大きな可能性を示している。しかし、既存のTiVRパラダイムは主に強化学習を通じて様々な視覚ツールを統合することに焦点を当てており、信頼性の低いまたは誤ったツール出力を処理する効果的な応答メカニズムの設計が軽視されてきた。この制限は参照接地タスクにおいて特に顕著で、不正確な検出ツールの予測がTiVRモデルを幻覚的な推論生成に誤導することが多い。この問題に対処するため、我々はツール精緻化を目指した初の参照接地推論フレームワークであるVG-Refinerを提案する。技術的には、モデルがツールのフィードバックを明示的に分析・応答することを可能にする二段階の思考・再思考メカニズムと、不適切なツール結果に対して効果的な修正を促進する精緻化報酬を導入する。さらに、2つの新規指標を提案し、現在のモデルの精緻化能力を体系的に測定する公平な評価プロトコルを確立した。少量のタスク特化データを用いてVG-Refinerの精緻化能力を強化し、事前学習モデルの汎用能力を維持しながら、参照推論接地ベンチマークにおいて精度と修正能力の大幅な向上を達成した。

English

Tool-integrated visual reasoning (TiVR) has demonstrated great potential in enhancing multimodal problem-solving. However, existing TiVR paradigms mainly focus on integrating various visual tools through reinforcement learning, while neglecting to design effective response mechanisms for handling unreliable or erroneous tool outputs. This limitation is particularly pronounced in referring and grounding tasks, where inaccurate detection tool predictions often mislead TiVR models into generating hallucinated reasoning. To address this issue, we propose the VG-Refiner, the first framework aiming at the tool-refined referring grounded reasoning. Technically, we introduce a two-stage think-rethink mechanism that enables the model to explicitly analyze and respond to tool feedback, along with a refinement reward that encourages effective correction in response to poor tool results. In addition, we propose two new metrics and establish fair evaluation protocols to systematically measure the refinement ability of current models. We adopt a small amount of task-specific data to enhance the refinement capability of VG-Refiner, achieving a significant improvement in accuracy and correction ability on referring and reasoning grounding benchmarks while preserving the general capabilities of the pretrained model.

VG-Refiner: エージェント的強化学習によるツール精緻化を目指した参照接地推論

VG-Refiner: Towards Tool-Refined Referring Grounded Reasoning via Agentic Reinforcement Learning

要旨

Support