OmniEAR: エンボディードタスクにおけるエージェント推論のベンチマーキング

要旨

大規模言語モデルは抽象的な推論において優れた能力を発揮しますが、エンボディエージェント（具現化されたエージェント）の推論能力についてはほとんど未開拓の領域です。本論文では、OmniEARという包括的なフレームワークを提案します。これは、言語モデルが物理的な相互作用、ツールの使用、およびエンボディされたタスクにおけるマルチエージェントの協調についてどのように推論するかを評価するためのものです。既存のベンチマークが事前に定義されたツールセットや明示的な協調指示を提供するのに対し、OmniEARでは、エージェントがタスクの要求に基づいて動的に能力を獲得し、自律的に協調戦略を決定する必要があります。テキストベースの環境表現を通じて、家庭や産業分野にわたる1,500のシナリオにおいて、連続的な物理的特性と複雑な空間関係をモデル化します。体系的な評価により、モデルが制約から推論しなければならない場合に深刻な性能低下が明らかになりました。明示的な指示では85-96%の成功率を達成する一方で、ツールの推論では56-85%、暗黙の協調では63-85%に低下し、複合タスクでは50%以上の失敗率を示しました。驚くべきことに、完全な環境情報は協調性能を低下させ、モデルがタスクに関連する制約をフィルタリングできないことを示しています。ファインチューニングにより単一エージェントタスクは劇的に改善されましたが（0.6%から76.3%）、マルチエージェントタスクでは最小限の向上しか見られず（1.5%から5.5%）、根本的なアーキテクチャの限界が露呈しました。これらの発見は、エンボディされた推論が現在のモデルが対処できるものとは根本的に異なる課題を提起していることを示しており、OmniEARがエンボディされたAIシステムを評価し進化させるための厳密なベンチマークとして確立されることを示しています。コードとデータは補足資料に含まれており、受理後はオープンソース化されます。

English

Large language models excel at abstract reasoning but their capacity for embodied agent reasoning remains largely unexplored. We present OmniEAR, a comprehensive framework for evaluating how language models reason about physical interactions, tool usage, and multi-agent coordination in embodied tasks. Unlike existing benchmarks that provide predefined tool sets or explicit collaboration directives, OmniEAR requires agents to dynamically acquire capabilities and autonomously determine coordination strategies based on task demands. Through text-based environment representation, we model continuous physical properties and complex spatial relationships across 1,500 scenarios spanning household and industrial domains. Our systematic evaluation reveals severe performance degradation when models must reason from constraints: while achieving 85-96% success with explicit instructions, performance drops to 56-85% for tool reasoning and 63-85% for implicit collaboration, with compound tasks showing over 50% failure rates. Surprisingly, complete environmental information degrades coordination performance, indicating models cannot filter task-relevant constraints. Fine-tuning improves single-agent tasks dramatically (0.6% to 76.3%) but yields minimal multi-agent gains (1.5% to 5.5%), exposing fundamental architectural limitations. These findings demonstrate that embodied reasoning poses fundamentally different challenges than current models can address, establishing OmniEAR as a rigorous benchmark for evaluating and advancing embodied AI systems. Our code and data are included in the supplementary materials and will be open-sourced upon acceptance.

OmniEAR: エンボディードタスクにおけるエージェント推論のベンチマーキング

OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks

要旨

Support