視覺工具代理(VisTA):一個基於強化學習的視覺工具選擇框架
VisualToolAgent (VisTA): A Reinforcement Learning Framework for Visual Tool Selection
May 26, 2025
作者: Zeyi Huang, Yuyang Ji, Anirudh Sundara Rajan, Zefan Cai, Wen Xiao, Junjie Hu, Yong Jae Lee
cs.AI
摘要
我們介紹了VisTA,這是一個新的強化學習框架,它賦能視覺代理基於實證表現動態探索、選擇並組合來自多樣化工具庫中的工具。現有的工具增強推理方法要么依賴於無訓練的提示,要么需要大規模的微調;這兩種方法都缺乏主動的工具探索,並且通常假設工具多樣性有限,而微調方法還需要大量的人工監督。相比之下,VisTA利用端到端的強化學習來迭代地精煉複雜的、針對特定查詢的工具選擇策略,並以任務結果作為反饋信號。通過群組相對策略優化(GRPO),我們的框架使代理能夠自主發現有效的工具選擇路徑,而無需顯式的推理監督。在ChartQA、Geometry3K和BlindTest基準測試上的實驗表明,VisTA在無訓練基線之上實現了顯著的性能提升,尤其是在分佈外樣例上。這些結果凸顯了VisTA在增強泛化能力、自適應利用多樣化工具方面的能力,並為構建靈活的、經驗驅動的視覺推理系統鋪平了道路。
English
We introduce VisTA, a new reinforcement learning framework that empowers
visual agents to dynamically explore, select, and combine tools from a diverse
library based on empirical performance. Existing methods for tool-augmented
reasoning either rely on training-free prompting or large-scale fine-tuning;
both lack active tool exploration and typically assume limited tool diversity,
and fine-tuning methods additionally demand extensive human supervision. In
contrast, VisTA leverages end-to-end reinforcement learning to iteratively
refine sophisticated, query-specific tool selection strategies, using task
outcomes as feedback signals. Through Group Relative Policy Optimization
(GRPO), our framework enables an agent to autonomously discover effective
tool-selection pathways without requiring explicit reasoning supervision.
Experiments on the ChartQA, Geometry3K, and BlindTest benchmarks demonstrate
that VisTA achieves substantial performance gains over training-free baselines,
especially on out-of-distribution examples. These results highlight VisTA's
ability to enhance generalization, adaptively utilize diverse tools, and pave
the way for flexible, experience-driven visual reasoning systems.Summary
AI-Generated Summary