VidEgoThink：评估具身体化人工智能的自我中心视频理解能力

摘要

最近多模态大型语言模型（MLLMs）的进展为具有体现式人工智能应用打开了新的途径。在之前的工作EgoThink的基础上，我们介绍了VidEgoThink，这是一个用于评估自我中心视频理解能力的全面基准。为了弥合MLLMs和体现式人工智能中低级控制之间的差距，我们设计了四个关键相互关联的任务：视频问答、层次规划、视觉定位和奖励建模。为了降低手动注释成本，我们基于Ego4D数据集开发了一个自动数据生成流水线，利用了GPT-4o的先前知识和多模态能力。然后，三名人类注释者过滤生成的数据，以确保多样性和质量，最终形成了VidEgoThink基准。我们使用三种类型的模型进行了大量实验：基于API的MLLMs、开源基于图像的MLLMs和开源基于视频的MLLMs。实验结果表明，包括GPT-4o在内的所有MLLMs在与自我中心视频理解相关的所有任务中表现不佳。这些发现表明，基础模型仍需要显著进步才能有效应用于体现式人工智能中的第一人称场景。总之，VidEgoThink反映了一种研究趋势，即利用MLLMs进行自我中心视觉，类似于人类能力，实现在复杂真实环境中的积极观察和互动。

English

Recent advancements in Multi-modal Large Language Models (MLLMs) have opened new avenues for applications in Embodied AI. Building on previous work, EgoThink, we introduce VidEgoThink, a comprehensive benchmark for evaluating egocentric video understanding capabilities. To bridge the gap between MLLMs and low-level control in Embodied AI, we design four key interrelated tasks: video question-answering, hierarchy planning, visual grounding and reward modeling. To minimize manual annotation costs, we develop an automatic data generation pipeline based on the Ego4D dataset, leveraging the prior knowledge and multimodal capabilities of GPT-4o. Three human annotators then filter the generated data to ensure diversity and quality, resulting in the VidEgoThink benchmark. We conduct extensive experiments with three types of models: API-based MLLMs, open-source image-based MLLMs, and open-source video-based MLLMs. Experimental results indicate that all MLLMs, including GPT-4o, perform poorly across all tasks related to egocentric video understanding. These findings suggest that foundation models still require significant advancements to be effectively applied to first-person scenarios in Embodied AI. In conclusion, VidEgoThink reflects a research trend towards employing MLLMs for egocentric vision, akin to human capabilities, enabling active observation and interaction in the complex real-world environments.

VidEgoThink：评估具身体化人工智能的自我中心视频理解能力

VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI

摘要

Support