VisualToolAgent (VisTA): 視覚的ツール選択のための強化学習フレームワーク

要旨

本論文では、VisTAという新しい強化学習フレームワークを紹介します。VisTAは、視覚エージェントが多様なツールライブラリから経験的なパフォーマンスに基づいて動的に探索、選択、組み合わせを行うことを可能にします。既存のツール拡張推論手法は、トレーニング不要なプロンプティングか大規模なファインチューニングに依存しており、いずれも能動的なツール探索を欠き、通常はツールの多様性が限られていると仮定しています。さらに、ファインチューニング手法では大規模な人間の監督が必要です。これに対し、VisTAはエンドツーエンドの強化学習を活用し、タスクの結果をフィードバック信号として用いることで、クエリ固有の洗練されたツール選択戦略を反復的に改善します。Group Relative Policy Optimization（GRPO）を通じて、本フレームワークはエージェントが明示的な推論監督を必要とせずに効果的なツール選択経路を自律的に発見することを可能にします。ChartQA、Geometry3K、BlindTestベンチマークでの実験により、VisTAがトレーニング不要なベースラインを大幅に上回るパフォーマンス向上を達成し、特に分布外の例において優れた結果を示すことが実証されました。これらの結果は、VisTAの汎化能力の向上、多様なツールの適応的利用、そして柔軟で経験駆動型の視覚推論システムへの道を開く能力を強調しています。

English

We introduce VisTA, a new reinforcement learning framework that empowers visual agents to dynamically explore, select, and combine tools from a diverse library based on empirical performance. Existing methods for tool-augmented reasoning either rely on training-free prompting or large-scale fine-tuning; both lack active tool exploration and typically assume limited tool diversity, and fine-tuning methods additionally demand extensive human supervision. In contrast, VisTA leverages end-to-end reinforcement learning to iteratively refine sophisticated, query-specific tool selection strategies, using task outcomes as feedback signals. Through Group Relative Policy Optimization (GRPO), our framework enables an agent to autonomously discover effective tool-selection pathways without requiring explicit reasoning supervision. Experiments on the ChartQA, Geometry3K, and BlindTest benchmarks demonstrate that VisTA achieves substantial performance gains over training-free baselines, especially on out-of-distribution examples. These results highlight VisTA's ability to enhance generalization, adaptively utilize diverse tools, and pave the way for flexible, experience-driven visual reasoning systems.