OmniEAR：具身任务中代理推理的基准测试

摘要

大型语言模型在抽象推理方面表现出色，但其在具身代理推理方面的能力仍待深入探索。本文提出了OmniEAR，一个全面评估语言模型在具身任务中如何推理物理交互、工具使用及多智能体协调的框架。与现有提供预定义工具集或明确协作指令的基准不同，OmniEAR要求智能体根据任务需求动态获取能力并自主确定协调策略。通过基于文本的环境表示，我们在涵盖家庭和工业领域的1500个场景中，模拟了连续的物理属性和复杂的空间关系。系统性评估显示，当模型必须从约束条件进行推理时，性能显著下降：在明确指令下成功率可达85-96%，而在工具推理和隐式协作中分别降至56-85%和63-85%，复合任务的失败率更是超过50%。令人意外的是，完整的环境信息反而降低了协调性能，表明模型无法筛选出与任务相关的约束。微调虽大幅提升了单智能体任务的表现（从0.6%提升至76.3%），但对多智能体任务的改善微乎其微（仅从1.5%提升至5.5%），揭示了基础架构的局限性。这些发现表明，具身推理提出了与当前模型所能应对的根本不同的挑战，确立了OmniEAR作为评估和推进具身AI系统的严格基准。我们的代码和数据已包含在补充材料中，并将在论文被接受后开源。

English

Large language models excel at abstract reasoning but their capacity for embodied agent reasoning remains largely unexplored. We present OmniEAR, a comprehensive framework for evaluating how language models reason about physical interactions, tool usage, and multi-agent coordination in embodied tasks. Unlike existing benchmarks that provide predefined tool sets or explicit collaboration directives, OmniEAR requires agents to dynamically acquire capabilities and autonomously determine coordination strategies based on task demands. Through text-based environment representation, we model continuous physical properties and complex spatial relationships across 1,500 scenarios spanning household and industrial domains. Our systematic evaluation reveals severe performance degradation when models must reason from constraints: while achieving 85-96% success with explicit instructions, performance drops to 56-85% for tool reasoning and 63-85% for implicit collaboration, with compound tasks showing over 50% failure rates. Surprisingly, complete environmental information degrades coordination performance, indicating models cannot filter task-relevant constraints. Fine-tuning improves single-agent tasks dramatically (0.6% to 76.3%) but yields minimal multi-agent gains (1.5% to 5.5%), exposing fundamental architectural limitations. These findings demonstrate that embodied reasoning poses fundamentally different challenges than current models can address, establishing OmniEAR as a rigorous benchmark for evaluating and advancing embodied AI systems. Our code and data are included in the supplementary materials and will be open-sourced upon acceptance.