OmniEAR: 구체화된 과제에서의 에이전트 추론 성능 벤치마킹

초록

대형 언어 모델은 추상적 추론에서 뛰어난 성능을 보이지만, 구체적인 에이전트 추론 능력은 아직 크게 탐구되지 않았다. 본 연구에서는 언어 모델이 물리적 상호작용, 도구 사용, 다중 에이전트 조정과 같은 구체적 작업에서 어떻게 추론하는지를 평가하기 위한 포괄적인 프레임워크인 OmniEAR을 제안한다. 기존 벤치마크가 사전 정의된 도구 세트나 명시적인 협업 지침을 제공하는 것과 달리, OmniEAR은 에이전트가 작업 요구에 따라 능력을 동적으로 획득하고 자율적으로 조정 전략을 결정하도록 요구한다. 텍스트 기반 환경 표현을 통해, 우리는 가정 및 산업 영역에 걸친 1,500개 시나리오에서 연속적인 물리적 특성과 복잡한 공간적 관계를 모델링한다. 체계적인 평가 결과, 모델이 제약 조건에서 추론해야 할 때 성능이 심각하게 저하되는 것으로 나타났다: 명시적 지침에서는 85-96%의 성공률을 보였으나, 도구 추론에서는 56-85%, 암묵적 협업에서는 63-85%로 성능이 하락했으며, 복합 작업에서는 50% 이상의 실패율을 보였다. 놀랍게도, 완전한 환경 정보는 조정 성능을 저하시켜, 모델이 작업 관련 제약 조건을 필터링할 수 없음을 나타냈다. 미세 조정은 단일 에이전트 작업에서 극적인 개선(0.6%에서 76.3%)을 보였으나, 다중 에이전트 작업에서는 최소한의 개선(1.5%에서 5.5%)만을 보여 근본적인 아키텍처적 한계를 드러냈다. 이러한 결과는 구체적 추론이 현재 모델이 해결할 수 있는 문제와 근본적으로 다른 도전 과제를 제기함을 보여주며, OmniEAR이 구체적 AI 시스템을 평가하고 발전시키기 위한 엄격한 벤치마크로 자리매김함을 입증한다. 본 연구의 코드와 데이터는 보충 자료에 포함되어 있으며, 논문 채택 시 공개될 예정이다.

English

Large language models excel at abstract reasoning but their capacity for embodied agent reasoning remains largely unexplored. We present OmniEAR, a comprehensive framework for evaluating how language models reason about physical interactions, tool usage, and multi-agent coordination in embodied tasks. Unlike existing benchmarks that provide predefined tool sets or explicit collaboration directives, OmniEAR requires agents to dynamically acquire capabilities and autonomously determine coordination strategies based on task demands. Through text-based environment representation, we model continuous physical properties and complex spatial relationships across 1,500 scenarios spanning household and industrial domains. Our systematic evaluation reveals severe performance degradation when models must reason from constraints: while achieving 85-96% success with explicit instructions, performance drops to 56-85% for tool reasoning and 63-85% for implicit collaboration, with compound tasks showing over 50% failure rates. Surprisingly, complete environmental information degrades coordination performance, indicating models cannot filter task-relevant constraints. Fine-tuning improves single-agent tasks dramatically (0.6% to 76.3%) but yields minimal multi-agent gains (1.5% to 5.5%), exposing fundamental architectural limitations. These findings demonstrate that embodied reasoning poses fundamentally different challenges than current models can address, establishing OmniEAR as a rigorous benchmark for evaluating and advancing embodied AI systems. Our code and data are included in the supplementary materials and will be open-sourced upon acceptance.

OmniEAR: 구체화된 과제에서의 에이전트 추론 성능 벤치마킹

OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks

초록

Support