EOC-Bench:多模态大語言模型能否識別、回憶並預測自我中心視角下的物體?
EOC-Bench: Can MLLMs Identify, Recall, and Forecast Objects in an Egocentric World?
June 5, 2025
作者: Yuqian Yuan, Ronghao Dang, Long Li, Wentong Li, Dian Jiao, Xin Li, Deli Zhao, Fan Wang, Wenqiao Zhang, Jun Xiao, Yueting Zhuang
cs.AI
摘要
多模态大语言模型(MLLMs)的兴起推动了以自我为中心视觉应用的突破性进展。这些应用要求对物体具有持续且情境感知的理解,因为用户是在动态且杂乱的环境中与工具进行交互。然而,现有的具身基准主要集中于静态场景探索,强调物体的外观和空间属性,而忽视了用户交互引发的动态变化评估。为填补这一空白,我们引入了EOC-Bench,一个旨在系统评估动态自我中心场景中物体中心具身认知的创新基准。特别地,EOC-Bench包含3,277个精心标注的问答对,分为三个时间类别:过去、现在和未来,涵盖了11个细粒度评估维度和3种视觉物体引用类型。为确保全面评估,我们开发了一个混合格式的人机循环标注框架,包含四种问题类型,并设计了一种新颖的多尺度时间准确性度量标准,用于开放式时间评估。基于EOC-Bench,我们对多种专有、开源及物体级别的MLLMs进行了全面评估。EOC-Bench作为提升MLLMs具身物体认知能力的关键工具,为开发可靠的具身系统核心模型奠定了坚实基础。
English
The emergence of multimodal large language models (MLLMs) has driven
breakthroughs in egocentric vision applications. These applications necessitate
persistent, context-aware understanding of objects, as users interact with
tools in dynamic and cluttered environments. However, existing embodied
benchmarks primarily focus on static scene exploration, emphasizing object's
appearance and spatial attributes while neglecting the assessment of dynamic
changes arising from users' interactions. To address this gap, we introduce
EOC-Bench, an innovative benchmark designed to systematically evaluate
object-centric embodied cognition in dynamic egocentric scenarios. Specially,
EOC-Bench features 3,277 meticulously annotated QA pairs categorized into three
temporal categories: Past, Present, and Future, covering 11 fine-grained
evaluation dimensions and 3 visual object referencing types. To ensure thorough
assessment, we develop a mixed-format human-in-the-loop annotation framework
with four types of questions and design a novel multi-scale temporal accuracy
metric for open-ended temporal evaluation. Based on EOC-Bench, we conduct
comprehensive evaluations of various proprietary, open-source, and object-level
MLLMs. EOC-Bench serves as a crucial tool for advancing the embodied object
cognitive capabilities of MLLMs, establishing a robust foundation for
developing reliable core models for embodied systems.