视觉工具代理（VisTA）：一种用于视觉工具选择的强化学习框架

摘要

我们推出VisTA，一个全新的强化学习框架，它赋予视觉代理动态探索、选择并基于实证性能从多样化工具库中组合工具的能力。现有工具增强推理方法要么依赖无需训练的提示机制，要么需要大规模微调；两者均缺乏主动工具探索，且通常假设工具多样性有限，而微调方法还额外要求大量人工监督。相比之下，VisTA利用端到端强化学习，以任务结果为反馈信号，迭代优化复杂且针对特定查询的工具选择策略。通过群体相对策略优化（GRPO），我们的框架使代理能够自主发现有效的工具选择路径，无需显式推理监督。在ChartQA、Geometry3K和BlindTest基准测试上的实验表明，VisTA相较于无需训练的基线方法实现了显著的性能提升，尤其是在分布外样本上。这些成果凸显了VisTA在增强泛化能力、自适应利用多样化工具方面的优势，为构建灵活、经验驱动的视觉推理系统铺平了道路。

English

We introduce VisTA, a new reinforcement learning framework that empowers visual agents to dynamically explore, select, and combine tools from a diverse library based on empirical performance. Existing methods for tool-augmented reasoning either rely on training-free prompting or large-scale fine-tuning; both lack active tool exploration and typically assume limited tool diversity, and fine-tuning methods additionally demand extensive human supervision. In contrast, VisTA leverages end-to-end reinforcement learning to iteratively refine sophisticated, query-specific tool selection strategies, using task outcomes as feedback signals. Through Group Relative Policy Optimization (GRPO), our framework enables an agent to autonomously discover effective tool-selection pathways without requiring explicit reasoning supervision. Experiments on the ChartQA, Geometry3K, and BlindTest benchmarks demonstrate that VisTA achieves substantial performance gains over training-free baselines, especially on out-of-distribution examples. These results highlight VisTA's ability to enhance generalization, adaptively utilize diverse tools, and pave the way for flexible, experience-driven visual reasoning systems.

视觉工具代理（VisTA）：一种用于视觉工具选择的强化学习框架

VisualToolAgent (VisTA): A Reinforcement Learning Framework for Visual Tool Selection

摘要

Support