視覺工具代理（VisTA）：一個基於強化學習的視覺工具選擇框架

摘要

我們介紹了VisTA，這是一個新的強化學習框架，它賦能視覺代理基於實證表現動態探索、選擇並組合來自多樣化工具庫中的工具。現有的工具增強推理方法要么依賴於無訓練的提示，要么需要大規模的微調；這兩種方法都缺乏主動的工具探索，並且通常假設工具多樣性有限，而微調方法還需要大量的人工監督。相比之下，VisTA利用端到端的強化學習來迭代地精煉複雜的、針對特定查詢的工具選擇策略，並以任務結果作為反饋信號。通過群組相對策略優化（GRPO），我們的框架使代理能夠自主發現有效的工具選擇路徑，而無需顯式的推理監督。在ChartQA、Geometry3K和BlindTest基準測試上的實驗表明，VisTA在無訓練基線之上實現了顯著的性能提升，尤其是在分佈外樣例上。這些結果凸顯了VisTA在增強泛化能力、自適應利用多樣化工具方面的能力，並為構建靈活的、經驗驅動的視覺推理系統鋪平了道路。

English

We introduce VisTA, a new reinforcement learning framework that empowers visual agents to dynamically explore, select, and combine tools from a diverse library based on empirical performance. Existing methods for tool-augmented reasoning either rely on training-free prompting or large-scale fine-tuning; both lack active tool exploration and typically assume limited tool diversity, and fine-tuning methods additionally demand extensive human supervision. In contrast, VisTA leverages end-to-end reinforcement learning to iteratively refine sophisticated, query-specific tool selection strategies, using task outcomes as feedback signals. Through Group Relative Policy Optimization (GRPO), our framework enables an agent to autonomously discover effective tool-selection pathways without requiring explicit reasoning supervision. Experiments on the ChartQA, Geometry3K, and BlindTest benchmarks demonstrate that VisTA achieves substantial performance gains over training-free baselines, especially on out-of-distribution examples. These results highlight VisTA's ability to enhance generalization, adaptively utilize diverse tools, and pave the way for flexible, experience-driven visual reasoning systems.