视觉工具代理(VisTA):一种用于视觉工具选择的强化学习框架
VisualToolAgent (VisTA): A Reinforcement Learning Framework for Visual Tool Selection
May 26, 2025
作者: Zeyi Huang, Yuyang Ji, Anirudh Sundara Rajan, Zefan Cai, Wen Xiao, Junjie Hu, Yong Jae Lee
cs.AI
摘要
我们推出VisTA,一个全新的强化学习框架,它赋予视觉代理动态探索、选择并基于实证性能从多样化工具库中组合工具的能力。现有工具增强推理方法要么依赖无需训练的提示机制,要么需要大规模微调;两者均缺乏主动工具探索,且通常假设工具多样性有限,而微调方法还额外要求大量人工监督。相比之下,VisTA利用端到端强化学习,以任务结果为反馈信号,迭代优化复杂且针对特定查询的工具选择策略。通过群体相对策略优化(GRPO),我们的框架使代理能够自主发现有效的工具选择路径,无需显式推理监督。在ChartQA、Geometry3K和BlindTest基准测试上的实验表明,VisTA相较于无需训练的基线方法实现了显著的性能提升,尤其是在分布外样本上。这些成果凸显了VisTA在增强泛化能力、自适应利用多样化工具方面的优势,为构建灵活、经验驱动的视觉推理系统铺平了道路。
English
We introduce VisTA, a new reinforcement learning framework that empowers
visual agents to dynamically explore, select, and combine tools from a diverse
library based on empirical performance. Existing methods for tool-augmented
reasoning either rely on training-free prompting or large-scale fine-tuning;
both lack active tool exploration and typically assume limited tool diversity,
and fine-tuning methods additionally demand extensive human supervision. In
contrast, VisTA leverages end-to-end reinforcement learning to iteratively
refine sophisticated, query-specific tool selection strategies, using task
outcomes as feedback signals. Through Group Relative Policy Optimization
(GRPO), our framework enables an agent to autonomously discover effective
tool-selection pathways without requiring explicit reasoning supervision.
Experiments on the ChartQA, Geometry3K, and BlindTest benchmarks demonstrate
that VisTA achieves substantial performance gains over training-free baselines,
especially on out-of-distribution examples. These results highlight VisTA's
ability to enhance generalization, adaptively utilize diverse tools, and pave
the way for flexible, experience-driven visual reasoning systems.Summary
AI-Generated Summary