ChatPaper.aiChatPaper

具體評估:將多模式LLM作為具體化代理進行評估

EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents

January 21, 2025
作者: Zhili Cheng, Yuge Tu, Ran Li, Shiqi Dai, Jinyi Hu, Shengding Hu, Jiahao Li, Yang Shi, Tianyu Yu, Weize Chen, Lei Shi, Maosong Sun
cs.AI

摘要

多模式大型語言模型(MLLMs)已顯示出顯著的進展,為具體代理人提供了一個有前途的未來。現有的用於評估MLLMs的基準主要使用靜態圖像或視頻,限制了評估範圍僅限於非互動場景。與此同時,現有的具體代理人人工智能基準是任務特定的,並且不夠多樣化,這無法充分評估MLLMs的具體能力。為了解決這個問題,我們提出了EmbodiedEval,這是一個針對MLLMs具體任務的全面互動評估基準。EmbodiedEval包含了328個不同任務,分佈在125個多樣的3D場景中,每個場景都經過嚴格選擇和標註。它涵蓋了廣泛的現有具體代理人人工智能任務,具有顯著增強的多樣性,全部在針對MLLMs量身定制的統一模擬和評估框架中進行。這些任務分為五個類別:導航、物體交互、社交交互、屬性問答和空間問答,以評估代理人的不同能力。我們在EmbodiedEval上評估了最先進的MLLMs,發現它們在具體任務上與人類水平相比存在顯著不足。我們的分析展示了現有MLLMs在具體能力方面的局限性,為其未來發展提供了見解。我們在https://github.com/thunlp/EmbodiedEval 開源了所有評估數據和模擬框架。
English
Multimodal Large Language Models (MLLMs) have shown significant advancements, providing a promising future for embodied agents. Existing benchmarks for evaluating MLLMs primarily utilize static images or videos, limiting assessments to non-interactive scenarios. Meanwhile, existing embodied AI benchmarks are task-specific and not diverse enough, which do not adequately evaluate the embodied capabilities of MLLMs. To address this, we propose EmbodiedEval, a comprehensive and interactive evaluation benchmark for MLLMs with embodied tasks. EmbodiedEval features 328 distinct tasks within 125 varied 3D scenes, each of which is rigorously selected and annotated. It covers a broad spectrum of existing embodied AI tasks with significantly enhanced diversity, all within a unified simulation and evaluation framework tailored for MLLMs. The tasks are organized into five categories: navigation, object interaction, social interaction, attribute question answering, and spatial question answering to assess different capabilities of the agents. We evaluated the state-of-the-art MLLMs on EmbodiedEval and found that they have a significant shortfall compared to human level on embodied tasks. Our analysis demonstrates the limitations of existing MLLMs in embodied capabilities, providing insights for their future development. We open-source all evaluation data and simulation framework at https://github.com/thunlp/EmbodiedEval.

Summary

AI-Generated Summary

PDF72January 25, 2025