VisualToolAgent (VisTA): 시각적 도구 선택을 위한 강화 학습 프레임워크

초록

우리는 시각적 에이전트가 경험적 성능을 기반으로 다양한 도구 라이브러리에서 동적으로 탐색, 선택, 결합할 수 있도록 하는 새로운 강화 학습 프레임워크인 VisTA를 소개한다. 기존의 도구 강화 추론 방법은 훈련 없이 프롬프트를 사용하거나 대규모 미세 조정에 의존하며, 둘 다 능동적인 도구 탐색이 부족하고 일반적으로 제한된 도구 다양성을 가정한다. 또한 미세 조정 방법은 광범위한 인간의 감독을 요구한다. 반면, VisTA는 종단 간 강화 학습을 활용하여 작업 결과를 피드백 신호로 사용하여 정교하고 쿼리 특화된 도구 선택 전략을 반복적으로 개선한다. 그룹 상대 정책 최적화(GRPO)를 통해 우리의 프레임워크는 명시적인 추론 감독 없이도 에이전트가 효과적인 도구 선택 경로를 자율적으로 발견할 수 있도록 한다. ChartQA, Geometry3K, BlindTest 벤치마크에서의 실험은 VisTA가 훈련 없이 사용하는 베이스라인 대비 특히 분포 외 예제에서 상당한 성능 향상을 달성함을 보여준다. 이러한 결과는 VisTA가 일반화를 강화하고 다양한 도구를 적응적으로 활용하며, 유연하고 경험 기반의 시각적 추론 시스템을 위한 길을 열어줄 수 있는 능력을 강조한다.

English

We introduce VisTA, a new reinforcement learning framework that empowers visual agents to dynamically explore, select, and combine tools from a diverse library based on empirical performance. Existing methods for tool-augmented reasoning either rely on training-free prompting or large-scale fine-tuning; both lack active tool exploration and typically assume limited tool diversity, and fine-tuning methods additionally demand extensive human supervision. In contrast, VisTA leverages end-to-end reinforcement learning to iteratively refine sophisticated, query-specific tool selection strategies, using task outcomes as feedback signals. Through Group Relative Policy Optimization (GRPO), our framework enables an agent to autonomously discover effective tool-selection pathways without requiring explicit reasoning supervision. Experiments on the ChartQA, Geometry3K, and BlindTest benchmarks demonstrate that VisTA achieves substantial performance gains over training-free baselines, especially on out-of-distribution examples. These results highlight VisTA's ability to enhance generalization, adaptively utilize diverse tools, and pave the way for flexible, experience-driven visual reasoning systems.