ChatPaper.aiChatPaper

视觉工具代理(VisTA):一种用于视觉工具选择的强化学习框架

VisualToolAgent (VisTA): A Reinforcement Learning Framework for Visual Tool Selection

May 26, 2025
作者: Zeyi Huang, Yuyang Ji, Anirudh Sundara Rajan, Zefan Cai, Wen Xiao, Junjie Hu, Yong Jae Lee
cs.AI

摘要

我们推出VisTA,一个全新的强化学习框架,它赋予视觉代理动态探索、选择并基于实证性能从多样化工具库中组合工具的能力。现有工具增强推理方法要么依赖无需训练的提示机制,要么需要大规模微调;两者均缺乏主动工具探索,且通常假设工具多样性有限,而微调方法还额外要求大量人工监督。相比之下,VisTA利用端到端强化学习,以任务结果为反馈信号,迭代优化复杂且针对特定查询的工具选择策略。通过群体相对策略优化(GRPO),我们的框架使代理能够自主发现有效的工具选择路径,无需显式推理监督。在ChartQA、Geometry3K和BlindTest基准测试上的实验表明,VisTA相较于无需训练的基线方法实现了显著的性能提升,尤其是在分布外样本上。这些成果凸显了VisTA在增强泛化能力、自适应利用多样化工具方面的优势,为构建灵活、经验驱动的视觉推理系统铺平了道路。
English
We introduce VisTA, a new reinforcement learning framework that empowers visual agents to dynamically explore, select, and combine tools from a diverse library based on empirical performance. Existing methods for tool-augmented reasoning either rely on training-free prompting or large-scale fine-tuning; both lack active tool exploration and typically assume limited tool diversity, and fine-tuning methods additionally demand extensive human supervision. In contrast, VisTA leverages end-to-end reinforcement learning to iteratively refine sophisticated, query-specific tool selection strategies, using task outcomes as feedback signals. Through Group Relative Policy Optimization (GRPO), our framework enables an agent to autonomously discover effective tool-selection pathways without requiring explicit reasoning supervision. Experiments on the ChartQA, Geometry3K, and BlindTest benchmarks demonstrate that VisTA achieves substantial performance gains over training-free baselines, especially on out-of-distribution examples. These results highlight VisTA's ability to enhance generalization, adaptively utilize diverse tools, and pave the way for flexible, experience-driven visual reasoning systems.

Summary

AI-Generated Summary

PDF62May 28, 2025