PhysToolBench:多模态大语言模型物理工具理解能力基准测试
PhysToolBench: Benchmarking Physical Tool Understanding for MLLMs
October 10, 2025
作者: Zixin Zhang, Kanghao Chen, Xingwang Lin, Lutao Jiang, Xu Zheng, Yuanhuiyi Lyu, Litao Guo, Yinchuan Li, Ying-Cong Chen
cs.AI
摘要
運用、理解及創造工具的能力,是人類智慧的顯著特徵,使我們能夠與物理世界進行複雜的互動。任何通用智能體若要實現真正的多功能性,也必須掌握這些基本技能。儘管現代多模態大型語言模型(MLLMs)利用其廣泛的常識知識,在具身人工智慧及下游視覺-語言-動作(VLA)模型中進行高層次規劃,但其對物理工具的真正理解程度仍未被量化。為彌補這一差距,我們提出了PhysToolBench,這是首個專注於評估MLLMs對物理工具理解能力的基準。我們的基準以視覺問答(VQA)數據集的形式構建,包含超過1,000個圖像-文本對。它評估了三個不同難度層次的能力:(1) 工具識別:要求識別工具的主要功能。(2) 工具理解:測試理解工具運作原理的能力。(3) 工具創造:挑戰模型在常規工具不可用時,利用周圍物品創造新工具的能力。我們對32個MLLMs進行了全面評估,涵蓋專有、開源、專用具身及VLA骨幹模型,結果顯示在工具理解方面存在顯著不足。此外,我們提供了深入分析並提出了初步解決方案。代碼和數據集已公開提供。
English
The ability to use, understand, and create tools is a hallmark of human
intelligence, enabling sophisticated interaction with the physical world. For
any general-purpose intelligent agent to achieve true versatility, it must also
master these fundamental skills. While modern Multimodal Large Language Models
(MLLMs) leverage their extensive common knowledge for high-level planning in
embodied AI and in downstream Vision-Language-Action (VLA) models, the extent
of their true understanding of physical tools remains unquantified. To bridge
this gap, we present PhysToolBench, the first benchmark dedicated to evaluating
the comprehension of physical tools by MLLMs. Our benchmark is structured as a
Visual Question Answering (VQA) dataset comprising over 1,000 image-text pairs.
It assesses capabilities across three distinct difficulty levels: (1) Tool
Recognition: Requiring the recognition of a tool's primary function. (2) Tool
Understanding: Testing the ability to grasp the underlying principles of a
tool's operation. (3) Tool Creation: Challenging the model to fashion a new
tool from surrounding objects when conventional options are unavailable. Our
comprehensive evaluation of 32 MLLMs-spanning proprietary, open-source,
specialized embodied, and backbones in VLAs-reveals a significant deficiency in
tool understanding. Furthermore, we provide an in-depth analysis and propose
preliminary solutions. Code and dataset are publicly available.