EOC-Bench: MLLMはエゴセントリックな世界における物体の識別、想起、予測が可能か？

要旨

マルチモーダル大規模言語モデル（MLLM）の出現は、エゴセントリックビジョンアプリケーションにおけるブレークスルーを推進してきました。これらのアプリケーションでは、ユーザーが動的で雑然とした環境でツールと相互作用する際に、オブジェクトに対する持続的で文脈を意識した理解が必要とされます。しかし、既存のエンボディドベンチマークは主に静的なシーン探索に焦点を当てており、オブジェクトの外観や空間的属性を重視する一方で、ユーザーの相互作用から生じる動的変化の評価を軽視しています。このギャップを埋めるため、我々はEOC-Benchを導入しました。これは、動的なエゴセントリックシナリオにおけるオブジェクト中心のエンボディド認知を体系的に評価するために設計された革新的なベンチマークです。特に、EOC-Benchは3,277の入念にアノテーションされたQAペアを特徴とし、過去、現在、未来の3つの時間カテゴリに分類され、11の細かい評価次元と3つの視覚的オブジェクト参照タイプをカバーしています。徹底的な評価を確保するため、我々は4種類の質問を含む混合形式のヒューマンインザループアノテーションフレームワークを開発し、オープンエンドの時間評価のための新しいマルチスケール時間精度メトリックを設計しました。EOC-Benchに基づいて、我々は様々なプロプライエタリ、オープンソース、およびオブジェクトレベルのMLLMを包括的に評価しました。EOC-Benchは、MLLMのエンボディドオブジェクト認知能力を向上させるための重要なツールとして機能し、エンボディドシステムの信頼性の高いコアモデルを開発するための堅固な基盤を確立します。

English

The emergence of multimodal large language models (MLLMs) has driven breakthroughs in egocentric vision applications. These applications necessitate persistent, context-aware understanding of objects, as users interact with tools in dynamic and cluttered environments. However, existing embodied benchmarks primarily focus on static scene exploration, emphasizing object's appearance and spatial attributes while neglecting the assessment of dynamic changes arising from users' interactions. To address this gap, we introduce EOC-Bench, an innovative benchmark designed to systematically evaluate object-centric embodied cognition in dynamic egocentric scenarios. Specially, EOC-Bench features 3,277 meticulously annotated QA pairs categorized into three temporal categories: Past, Present, and Future, covering 11 fine-grained evaluation dimensions and 3 visual object referencing types. To ensure thorough assessment, we develop a mixed-format human-in-the-loop annotation framework with four types of questions and design a novel multi-scale temporal accuracy metric for open-ended temporal evaluation. Based on EOC-Bench, we conduct comprehensive evaluations of various proprietary, open-source, and object-level MLLMs. EOC-Bench serves as a crucial tool for advancing the embodied object cognitive capabilities of MLLMs, establishing a robust foundation for developing reliable core models for embodied systems.

EOC-Bench: MLLMはエゴセントリックな世界における物体の識別、想起、予測が可能か？

EOC-Bench: Can MLLMs Identify, Recall, and Forecast Objects in an Egocentric World?

要旨

Support