TACO: ツール拡張によるエージェンティックツール使用のクレジット最適化

要旨

エージェント型マルチモーダルモデルは、コードを介して画像に対して多様な操作を実行し、返されたビューに基づいて推論を行う。これは、細粒度の視覚的質問応答における効果的なパラダイムである。しかし、コード操作は有用であったり、冗長であったり、誤解を招く場合がある。結果のみの報酬ではこれらのケースを正確に区別できず、既存のプロセス報酬は、最終的な正しさを個々のツール呼び出しに帰属させることができないか、外部の判断モデルを必要とするかのいずれかである。この問題に対処するために、我々はTool-Augmented Credit Optimization (TACO)を導入する。これは、コードツールエージェント向けのGRPOの変種であり、2つの結合されたアドバンテージチャネルに基づいている。1つ目のチャネルはDifferential Answer-Probe Reward (DAPR)であり、これは自己教師ありで判断モデルを必要としないツール寄与アドバンテージであり、各ツール呼び出しを、正しく回答することに対する自身の効果に基づいてクレジットする。モデルの推論に挿入されたプローブトークンは、ツールがある場合とない場合の予測を引き出し、それらの結果報酬の差が呼び出しの価値とみなされる。有用な呼び出しでは正、誤解を招くものでは負、何も変えないものではゼロとなる。これは、補助的な判断モデルなしで既存の回答チェッカーを再利用し、絶対的なプローブスコアではなく差分であるため、自然にプローブハッキングに対して頑健である。 2つ目のチャネルは、最終回答からの結果アドバンテージであり、Outcome-Gated Advantage Routing (OGAR)によって分配される。これはパラメータフリーのルールであり、呼び出しの結果に条件付けられて、このクレジットを責任のあるセグメントのみに届け、コスト項なしで無駄なツール呼び出しを抑制する。我々は、2段階のSFT+RLパイプラインを通じてTACOを訓練する。知覚、推論、および一般的なマルチモーダルベンチマークにわたる広範な実験により、一貫した精度向上が得られ、ツールが役立つ場合にのみ呼び出すことを学習することが示される。

English

Agentic multimodal models perform diverse operations on an image via code and reason over the returned view, an effective paradigm for fine-grained visual question answering. However, code operations can be useful, redundant, or misleading. Outcome-only rewards cannot precisely distinguish these cases, and existing process rewards either fail to attribute final correctness to individual tool calls, or require an external judge model. To address this, we introduce Tool-Augmented Credit Optimization (TACO), a GRPO variant for code-tool agents built on two coupled advantage channels. The first, Differential Answer-Probe Reward (DAPR), is a self-supervised, judge-free tool-contribution advantage that credits each tool call by its own effect on answering correctly. Probe tokens inserted into the model's reasoning elicit its predictions with and without the tool, and the difference in outcome reward is taken as the call's value: positive for a useful call, negative for a misleading one, and zero for one that changes nothing. This reuses the existing answer checker with no auxiliary judge, and, being a difference rather than an absolute probe score, is naturally robust to probe-hacking. The second is the outcome advantage from the final answer, distributed by Outcome-Gated Advantage Routing (OGAR): a parameter-free rule that, conditioned on the call's outcome, delivers this credit only to the responsible segments, suppressing wasted tool calls without any cost term. We train TACO through a two-stage SFT+RL pipeline. Extensive experiments across perception, reasoning, and general multimodal benchmarks show that it yields consistent accuracy gains and learns to invoke its tools only when they help.