VTC-Bench：通过组合式视觉工具链评估智能体多模态模型

摘要

近期研究将多模态大语言模型（MLLMs）的应用从标准视觉问答扩展到利用外部工具处理高级视觉任务。尽管取得进展，但精确执行并有效组合多样化工具以完成复杂任务仍是持续存在的瓶颈。受限于稀疏的工具集和简单的工具使用轨迹，现有基准测试无法捕捉复杂多样的工具交互，难以评估模型在实际应用场景下的表现。为弥补这一差距，我们推出VisualToolChain-Bench（VTC-Bench）——一个用于评估MLLMs工具使用能力的综合基准测试平台。为贴合实际计算机视觉流程，该框架集成32种基于OpenCV的多样化视觉操作。丰富的工具集支持广泛组合，使VTC-Bench能严格评估多工具组合能力及长周期多步骤规划执行效果。我们精心构建了包含680个问题的测试集，这些问题按九级认知层次分类，并配有真实执行轨迹作为标准答案。对19个主流MLLMs的大规模实验揭示了当前模型在视觉智能体能力上的显著局限：模型难以适应多样化工具集并对未见过操作实现泛化，领先模型Gemini-3.0-Pro在本基准中仅达到51%准确率；多工具组合仍是持续挑战，面对复杂任务时模型难以制定高效执行计划，过度依赖熟悉功能的狭窄子集而非选择最优工具。通过揭示这些根本性挑战，VTC-Bench为开发更具泛化能力的视觉智能体模型建立了严谨的评估基线。

English

Recent advancements extend Multimodal Large Language Models (MLLMs) beyond standard visual question answering to utilizing external tools for advanced visual tasks. Despite this progress, precisely executing and effectively composing diverse tools for complex tasks remain persistent bottleneck. Constrained by sparse tool-sets and simple tool-use trajectories, existing benchmarks fail to capture complex and diverse tool interactions, falling short in evaluating model performance under practical, real-world conditions. To bridge this gap, we introduce VisualToolChain-Bench(VTC-Bench), a comprehensive benchmark designed to evaluate tool-use proficiency in MLLMs. To align with realistic computer vision pipelines, our framework features 32 diverse OpenCV-based visual operations. This rich tool-set enables extensive combinations, allowing VTC-Bench to rigorously assess multi-tool composition and long-horizon, multi-step plan execution. For precise evaluation, we provide 680 curated problems structured across a nine-category cognitive hierarchy, each with ground-truth execution trajectories. Extensive experiments on 19 leading MLLMs reveal critical limitations in current models' visual agentic capabilities. Specifically, models struggle to adapt to diverse tool-sets and generalize to unseen operations, with the leading model Gemini-3.0-Pro only achieving 51% on our benchmark. Furthermore, multi-tool composition remains a persistent challenge. When facing complex tasks, models struggle to formulate efficient execution plans, relying heavily on a narrow, suboptimal subset of familiar functions rather than selecting the optimal tools. By identifying these fundamental challenges, VTC-Bench establishes a rigorous baseline to guide the development of more generalized visual agentic models.

VTC-Bench：通过组合式视觉工具链评估智能体多模态模型

VTC-Bench: Evaluating Agentic Multimodal Models via Compositional Visual Tool Chaining

摘要

Support