MM-Ego：朝向建立以自我為中心的多模式LLM前進

摘要

本研究旨在全面探索建立用於自我中心視頻理解的多模態基礎模型。為實現此目標，我們從三個方面進行工作。首先，由於缺乏自我中心視頻理解的問答數據，我們開發了一個數據引擎，可以高效生成700萬個高質量的自我中心視頻問答樣本，範圍從30秒到一小時不等，基於人工標註數據。這目前是最大的自我中心問答數據集。其次，我們貢獻了一個具有挑戰性的自我中心問答基準測試集，包含629個視頻和7026個問題，用於評估模型識別和記憶視覺細節的能力，這些視頻長度各異。我們引入了一種新的去偏差評估方法，以幫助減輕模型評估中不可避免的語言偏差。第三，我們提出了一種專門的多模態架構，具有一種新穎的“記憶指針提示”機制。該設計包括一個全局瞥見步驟，以獲得對整個視頻的全面理解並識別關鍵視覺信息，然後是一個後備步驟，利用關鍵視覺信息生成回答。這使模型能夠更有效地理解延伸的視頻內容。通過數據、基準測試和模型，我們成功構建了MM-Ego，一個自我中心多模態LLM，在自我中心視頻理解方面表現出強大的性能。

English

This research aims to comprehensively explore building a multimodal foundation model for egocentric video understanding. To achieve this goal, we work on three fronts. First, as there is a lack of QA data for egocentric video understanding, we develop a data engine that efficiently generates 7M high-quality QA samples for egocentric videos ranging from 30 seconds to one hour long, based on human-annotated data. This is currently the largest egocentric QA dataset. Second, we contribute a challenging egocentric QA benchmark with 629 videos and 7,026 questions to evaluate the models' ability in recognizing and memorizing visual details across videos of varying lengths. We introduce a new de-biasing evaluation method to help mitigate the unavoidable language bias present in the models being evaluated. Third, we propose a specialized multimodal architecture featuring a novel "Memory Pointer Prompting" mechanism. This design includes a global glimpse step to gain an overarching understanding of the entire video and identify key visual information, followed by a fallback step that utilizes the key visual information to generate responses. This enables the model to more effectively comprehend extended video content. With the data, benchmark, and model, we successfully build MM-Ego, an egocentric multimodal LLM that shows powerful performance on egocentric video understanding.

MM-Ego：朝向建立以自我為中心的多模式LLM前進

MM-Ego: Towards Building Egocentric Multimodal LLMs

摘要

Support