EOC-Bench: MLLM이 자기중심적 세계에서 객체를 식별, 회상 및 예측할 수 있는가?

초록

다중 양식 대형 언어 모델(Multimodal Large Language Models, MLLMs)의 등장은 에고센트릭 비전(egocentric vision) 응용 분야에서의 혁신을 이끌어냈다. 이러한 응용 분야는 사용자가 동적이고 복잡한 환경에서 도구와 상호작용할 때, 객체에 대한 지속적이고 맥락을 인지하는 이해를 필요로 한다. 그러나 기존의 구현된 벤치마크는 주로 정적인 장면 탐색에 초점을 맞추어 객체의 외형과 공간적 속성을 강조하는 반면, 사용자 상호작용으로 인한 동적 변화의 평가는 소홀히 다루고 있다. 이러한 격차를 해결하기 위해, 우리는 동적 에고센트릭 시나리오에서 객체 중심의 구현된 인지 능력을 체계적으로 평가하기 위해 설계된 혁신적인 벤치마크인 EOC-Bench를 소개한다. 특히, EOC-Bench는 과거, 현재, 미래의 세 가지 시간적 범주로 분류된 3,277개의 세심하게 주석이 달린 질문-답변 쌍을 특징으로 하며, 11개의 세부 평가 차원과 3가지 시각적 객체 참조 유형을 포함한다. 철저한 평가를 보장하기 위해, 우리는 네 가지 유형의 질문을 포함한 혼합 형식의 인간 참여 주석 프레임워크를 개발하고, 개방형 시간적 평가를 위한 새로운 다중 스케일 시간적 정확도 지표를 설계했다. EOC-Bench를 기반으로, 우리는 다양한 전유, 오픈소스 및 객체 수준의 MLLMs에 대한 포괄적인 평가를 수행한다. EOC-Bench는 MLLMs의 구현된 객체 인지 능력을 발전시키는 데 중요한 도구로 작용하며, 구현된 시스템을 위한 신뢰할 수 있는 코어 모델 개발을 위한 견고한 기반을 마련한다.

English

The emergence of multimodal large language models (MLLMs) has driven breakthroughs in egocentric vision applications. These applications necessitate persistent, context-aware understanding of objects, as users interact with tools in dynamic and cluttered environments. However, existing embodied benchmarks primarily focus on static scene exploration, emphasizing object's appearance and spatial attributes while neglecting the assessment of dynamic changes arising from users' interactions. To address this gap, we introduce EOC-Bench, an innovative benchmark designed to systematically evaluate object-centric embodied cognition in dynamic egocentric scenarios. Specially, EOC-Bench features 3,277 meticulously annotated QA pairs categorized into three temporal categories: Past, Present, and Future, covering 11 fine-grained evaluation dimensions and 3 visual object referencing types. To ensure thorough assessment, we develop a mixed-format human-in-the-loop annotation framework with four types of questions and design a novel multi-scale temporal accuracy metric for open-ended temporal evaluation. Based on EOC-Bench, we conduct comprehensive evaluations of various proprietary, open-source, and object-level MLLMs. EOC-Bench serves as a crucial tool for advancing the embodied object cognitive capabilities of MLLMs, establishing a robust foundation for developing reliable core models for embodied systems.

EOC-Bench: MLLM이 자기중심적 세계에서 객체를 식별, 회상 및 예측할 수 있는가?

EOC-Bench: Can MLLMs Identify, Recall, and Forecast Objects in an Egocentric World?

초록

Support