EmbodiedBench: 視覚駆動型具現エージェント向けの多モーダル大規模言語モデルの包括的ベンチマーク

要旨

マルチモーダル大規模言語モデル（MLLMs）を活用して具現化エージェントを作成することは、現実世界のタスクに取り組むための有望な手段を提供します。言語中心の具現化エージェントは注目を集めていますが、MLLMベースの具現化エージェントは包括的な評価フレームワークの不足のために未開拓の領域となっています。このギャップを埋めるために、ビジョン駆動型の具現化エージェントを評価するために設計された包括的なベンチマークであるEmbodiedBenchを紹介します。EmbodiedBenchには以下が特徴として含まれます：（1）高レベルの意味的タスク（例：家庭）からナビゲーションや操作などの低レベルの原子的なアクションを含む、4つの環境にわたる1,128のテストタスクの多様なセット；および（2）常識的な推論、複雑な指示理解、空間認識、視覚認識、長期計画などの重要なエージェント能力を評価する、厳選された6つのサブセット。豊富な実験を通じて、13の主要なプロプライエタリおよびオープンソースのMLLMをEmbodiedBench内で評価しました。我々の調査結果によれば、MLLMは高レベルのタスクに優れていますが、低レベルの操作に苦労しており、最も優れたモデルであるGPT-4oでも平均28.9％しかスコアを獲得していません。EmbodiedBenchは、既存の課題を浮き彫りにするだけでなく、MLLMベースの具現化エージェントを前進させるための貴重な洞察を提供する多面的な標準化された評価プラットフォームを提供します。我々のコードはhttps://embodiedbench.github.ioで入手可能です。

English

Leveraging Multi-modal Large Language Models (MLLMs) to create embodied agents offers a promising avenue for tackling real-world tasks. While language-centric embodied agents have garnered substantial attention, MLLM-based embodied agents remain underexplored due to the lack of comprehensive evaluation frameworks. To bridge this gap, we introduce EmbodiedBench, an extensive benchmark designed to evaluate vision-driven embodied agents. EmbodiedBench features: (1) a diverse set of 1,128 testing tasks across four environments, ranging from high-level semantic tasks (e.g., household) to low-level tasks involving atomic actions (e.g., navigation and manipulation); and (2) six meticulously curated subsets evaluating essential agent capabilities like commonsense reasoning, complex instruction understanding, spatial awareness, visual perception, and long-term planning. Through extensive experiments, we evaluated 13 leading proprietary and open-source MLLMs within EmbodiedBench. Our findings reveal that: MLLMs excel at high-level tasks but struggle with low-level manipulation, with the best model, GPT-4o, scoring only 28.9% on average. EmbodiedBench provides a multifaceted standardized evaluation platform that not only highlights existing challenges but also offers valuable insights to advance MLLM-based embodied agents. Our code is available at https://embodiedbench.github.io.

EmbodiedBench: 視覚駆動型具現エージェント向けの多モーダル大規模言語モデルの包括的ベンチマーク

EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents

要旨

Support