PhysToolBench:多模态大语言模型物理工具理解能力基准测试
PhysToolBench: Benchmarking Physical Tool Understanding for MLLMs
October 10, 2025
作者: Zixin Zhang, Kanghao Chen, Xingwang Lin, Lutao Jiang, Xu Zheng, Yuanhuiyi Lyu, Litao Guo, Yinchuan Li, Ying-Cong Chen
cs.AI
摘要
运用、理解及创造工具的能力,是人类智能的显著标志,使我们能够与物理世界进行复杂的互动。对于任何通用智能体而言,要实现真正的多功能性,掌握这些基本技能同样不可或缺。尽管现代多模态大语言模型(MLLMs)凭借其丰富的常识知识,在具身人工智能及下游的视觉-语言-动作(VLA)模型中执行高级规划,但它们对物理工具的真实理解程度仍未被量化。为填补这一空白,我们推出了PhysToolBench,这是首个专门评估MLLMs对物理工具理解能力的基准。该基准构建为一个视觉问答(VQA)数据集,包含超过1000个图文对,从三个不同难度层次评估模型能力:(1)工具识别:要求识别工具的主要功能。(2)工具理解:测试理解工具运作原理的能力。(3)工具创造:挑战模型在常规工具不可用时,利用周围物品创造新工具的能力。我们对32个MLLMs——涵盖专有、开源、专用具身及VLA骨干模型——的全面评估揭示了它们在工具理解上的显著不足。此外,我们提供了深入分析并提出了初步解决方案。代码与数据集已公开。
English
The ability to use, understand, and create tools is a hallmark of human
intelligence, enabling sophisticated interaction with the physical world. For
any general-purpose intelligent agent to achieve true versatility, it must also
master these fundamental skills. While modern Multimodal Large Language Models
(MLLMs) leverage their extensive common knowledge for high-level planning in
embodied AI and in downstream Vision-Language-Action (VLA) models, the extent
of their true understanding of physical tools remains unquantified. To bridge
this gap, we present PhysToolBench, the first benchmark dedicated to evaluating
the comprehension of physical tools by MLLMs. Our benchmark is structured as a
Visual Question Answering (VQA) dataset comprising over 1,000 image-text pairs.
It assesses capabilities across three distinct difficulty levels: (1) Tool
Recognition: Requiring the recognition of a tool's primary function. (2) Tool
Understanding: Testing the ability to grasp the underlying principles of a
tool's operation. (3) Tool Creation: Challenging the model to fashion a new
tool from surrounding objects when conventional options are unavailable. Our
comprehensive evaluation of 32 MLLMs-spanning proprietary, open-source,
specialized embodied, and backbones in VLAs-reveals a significant deficiency in
tool understanding. Furthermore, we provide an in-depth analysis and propose
preliminary solutions. Code and dataset are publicly available.