EmbodiedBench:為以視覺驅動的具體化代理人提供全面評估的多模式大型語言模型基準。
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents
February 13, 2025
作者: Rui Yang, Hanyang Chen, Junyu Zhang, Mark Zhao, Cheng Qian, Kangrui Wang, Qineng Wang, Teja Venkat Koripella, Marziyeh Movahedi, Manling Li, Heng Ji, Huan Zhang, Tong Zhang
cs.AI
摘要
利用多模式大型語言模型(MLLMs)來創建具體化代理人為應對現實世界任務提供了一個有前途的途徑。雖然以語言為中心的具體化代理人已經引起了相當大的關注,但基於MLLM的具體化代理人由於缺乏全面的評估框架而尚未被充分探索。為彌合這一差距,我們介紹了EmbodiedBench,這是一個旨在評估以視覺驅動的具體化代理人的廣泛基準。EmbodiedBench包括:(1)一個多樣化的測試任務集,涵蓋四個環境,從高層語義任務(例如家庭)到涉及原子動作的低層任務(例如導航和操作);以及(2)六個精心策劃的子集,評估基本代理人能力,如常識推理、複雜指令理解、空間意識、視覺感知和長期規劃。通過大量實驗,我們在EmbodiedBench中評估了13種領先的專有和開源MLLM。我們的研究發現:MLLM在高層任務方面表現出色,但在低層操作方面表現不佳,最佳模型GPT-4o的平均得分僅為28.9%。EmbodiedBench提供了一個多面向的標準化評估平台,不僅突顯了現有挑戰,還提供了有價值的見解,以推進基於MLLM的具體化代理人。我們的程式碼可在https://embodiedbench.github.io 上找到。
English
Leveraging Multi-modal Large Language Models (MLLMs) to create embodied
agents offers a promising avenue for tackling real-world tasks. While
language-centric embodied agents have garnered substantial attention,
MLLM-based embodied agents remain underexplored due to the lack of
comprehensive evaluation frameworks. To bridge this gap, we introduce
EmbodiedBench, an extensive benchmark designed to evaluate vision-driven
embodied agents. EmbodiedBench features: (1) a diverse set of 1,128 testing
tasks across four environments, ranging from high-level semantic tasks (e.g.,
household) to low-level tasks involving atomic actions (e.g., navigation and
manipulation); and (2) six meticulously curated subsets evaluating essential
agent capabilities like commonsense reasoning, complex instruction
understanding, spatial awareness, visual perception, and long-term planning.
Through extensive experiments, we evaluated 13 leading proprietary and
open-source MLLMs within EmbodiedBench. Our findings reveal that: MLLMs excel
at high-level tasks but struggle with low-level manipulation, with the best
model, GPT-4o, scoring only 28.9% on average. EmbodiedBench provides a
multifaceted standardized evaluation platform that not only highlights existing
challenges but also offers valuable insights to advance MLLM-based embodied
agents. Our code is available at https://embodiedbench.github.io.Summary
AI-Generated Summary