VTC-Bench: 조합형 시각적 도구 연쇄를 통한 에이전트형 멀티모달 모델 평가

초록

최근 멀티모달 대규모 언어 모델(MLLM)의 발전으로 시각 질의응답을 넘어 외부 도구를 활용한 고급 시각 작업 수행이 가능해졌습니다. 그러나 이러한 진전에도 불구하고, 다양한 도구를 정확하게 실행하고 효과적으로 조합하여 복잡한 작업을 수행하는 것은 여전히 해결해야 할 과제로 남아 있습니다. 기존 벤치마크는 제한된 도구 세트와 단순한 도구 사용 경로에 구애되어 복잡하고 다양한 도구 상호작용을 포착하지 못하며, 실제 현실적인 조건에서 모델 성능을 평가하는 데 한계를 보입니다. 이러한 격차를 해소하기 위해 우리는 MLLM의 도구 활용 능력을 평가하기 위한 포괄적인 벤치마크인 VisualToolChain-Bench(VTC-Bench)를 소개합니다. 실제 컴퓨터 비전 파이프라인과 부합하도록, 우리의 프레임워크는 32가지 다양한 OpenCV 기반 시각 연산을 특징으로 합니다. 이 풍부한 도구 세트는 광범위한 조합을 가능하게 하여 VTC-Bench가 다중 도구 구성 및 장기적, 다단계 계획 실행을 엄격하게 평가할 수 있도록 합니다. 정확한 평가를 위해 우리는 9개 범주의 인지 계층 구조로 구성된 680개의 정밀하게 선별된 문제와 각 문제에 대한 정답 실행 경로를 제공합니다. 19개의 주요 MLLM에 대한 광범위한 실험을 통해 현재 모델들의 시각 에이전트 능력에 중대한 한계가 있음을 확인했습니다. 구체적으로, 모델들은 다양한 도구 세트에 적응하고 보지 않은 연산으로 일반화하는 데 어려움을 겪었으며, 선두 모델인 Gemini-3.0-Pro는 우리 벤치마크에서 51%에 그쳤습니다. 더욱이 다중 도구 구성은 지속적인 난제로 남아있습니다. 복잡한 작업에 직면했을 때 모델들은 효율적인 실행 계획을 수립하지 못하고 최적의 도구를 선택하기보다 익숙한 소수의 하위 최적 함수들에 크게 의존하는 모습을 보였습니다. VTC-Bench는 이러한 근본적인 문제점들을 규명함으로써, 보다 일반화된 시각 에이전트 모델 개발을 이끌 수 있는 엄격한 기준선을 마련합니다.

English

Recent advancements extend Multimodal Large Language Models (MLLMs) beyond standard visual question answering to utilizing external tools for advanced visual tasks. Despite this progress, precisely executing and effectively composing diverse tools for complex tasks remain persistent bottleneck. Constrained by sparse tool-sets and simple tool-use trajectories, existing benchmarks fail to capture complex and diverse tool interactions, falling short in evaluating model performance under practical, real-world conditions. To bridge this gap, we introduce VisualToolChain-Bench(VTC-Bench), a comprehensive benchmark designed to evaluate tool-use proficiency in MLLMs. To align with realistic computer vision pipelines, our framework features 32 diverse OpenCV-based visual operations. This rich tool-set enables extensive combinations, allowing VTC-Bench to rigorously assess multi-tool composition and long-horizon, multi-step plan execution. For precise evaluation, we provide 680 curated problems structured across a nine-category cognitive hierarchy, each with ground-truth execution trajectories. Extensive experiments on 19 leading MLLMs reveal critical limitations in current models' visual agentic capabilities. Specifically, models struggle to adapt to diverse tool-sets and generalize to unseen operations, with the leading model Gemini-3.0-Pro only achieving 51% on our benchmark. Furthermore, multi-tool composition remains a persistent challenge. When facing complex tasks, models struggle to formulate efficient execution plans, relying heavily on a narrow, suboptimal subset of familiar functions rather than selecting the optimal tools. By identifying these fundamental challenges, VTC-Bench establishes a rigorous baseline to guide the development of more generalized visual agentic models.

VTC-Bench: 조합형 시각적 도구 연쇄를 통한 에이전트형 멀티모달 모델 평가

VTC-Bench: Evaluating Agentic Multimodal Models via Compositional Visual Tool Chaining

초록

Support