VTC-Bench：通过组合式视觉工具链评估具身多模态模型

摘要

近期研究将多模态大语言模型（MLLMs）的应用从标准视觉问答扩展到利用外部工具处理高级视觉任务。尽管取得进展，但如何精确执行并有效组合多样化工具以完成复杂任务仍是持续存在的瓶颈。受限于稀疏的工具集和简单的工具使用轨迹，现有基准测试难以捕捉复杂多样的工具交互，无法在实际场景下有效评估模型性能。为弥补这一差距，我们推出VisualToolChain-Bench（VTC-Bench）——一个用于评估MLLMs工具使用能力的综合基准。为贴合实际计算机视觉流程，该框架集成32种基于OpenCV的多样化视觉操作。丰富的工具集支持广泛组合，使VTC-Bench能够严格评估多工具组合能力及长周期多步骤规划执行效果。我们精心构建了涵盖九级认知层次的680道标准化题目，每题均提供真实执行轨迹以实现精准评估。对19个主流MLLMs的大规模实验揭示了当前模型在视觉智能体能力上的显著局限：模型难以适应多样化工具集并泛化至未见过操作，领先模型Gemini-3.0-Pro在本基准中仅达到51%准确率；多工具组合仍是持续挑战，面对复杂任务时模型难以制定高效执行计划，过度依赖熟悉功能的狭窄子集而非选择最优工具。通过揭示这些根本性挑战，VTC-Bench为开发更具泛化能力的视觉智能体模型建立了严谨的基准参照。

English

Recent advancements extend Multimodal Large Language Models (MLLMs) beyond standard visual question answering to utilizing external tools for advanced visual tasks. Despite this progress, precisely executing and effectively composing diverse tools for complex tasks remain persistent bottleneck. Constrained by sparse tool-sets and simple tool-use trajectories, existing benchmarks fail to capture complex and diverse tool interactions, falling short in evaluating model performance under practical, real-world conditions. To bridge this gap, we introduce VisualToolChain-Bench(VTC-Bench), a comprehensive benchmark designed to evaluate tool-use proficiency in MLLMs. To align with realistic computer vision pipelines, our framework features 32 diverse OpenCV-based visual operations. This rich tool-set enables extensive combinations, allowing VTC-Bench to rigorously assess multi-tool composition and long-horizon, multi-step plan execution. For precise evaluation, we provide 680 curated problems structured across a nine-category cognitive hierarchy, each with ground-truth execution trajectories. Extensive experiments on 19 leading MLLMs reveal critical limitations in current models' visual agentic capabilities. Specifically, models struggle to adapt to diverse tool-sets and generalize to unseen operations, with the leading model Gemini-3.0-Pro only achieving 51% on our benchmark. Furthermore, multi-tool composition remains a persistent challenge. When facing complex tasks, models struggle to formulate efficient execution plans, relying heavily on a narrow, suboptimal subset of familiar functions rather than selecting the optimal tools. By identifying these fundamental challenges, VTC-Bench establishes a rigorous baseline to guide the development of more generalized visual agentic models.