EgoLife:邁向以自我為中心的生活助手
EgoLife: Towards Egocentric Life Assistant
March 5, 2025
作者: Jingkang Yang, Shuai Liu, Hongming Guo, Yuhao Dong, Xiamengwei Zhang, Sicheng Zhang, Pengyun Wang, Zitang Zhou, Binzhu Xie, Ziyue Wang, Bei Ouyang, Zhengyu Lin, Marco Cominelli, Zhongang Cai, Yuanhan Zhang, Peiyuan Zhang, Fangzhou Hong, Joerg Widmer, Francesco Gringoli, Lei Yang, Bo Li, Ziwei Liu
cs.AI
摘要
我們推出EgoLife項目,旨在開發一款以自我為中心的生活助手,通過AI驅動的可穿戴眼鏡來陪伴並提升個人效率。為奠定此助手的基礎,我們進行了一項全面的數據收集研究,六位參與者共同生活一週,持續記錄他們的日常活動——包括討論、購物、烹飪、社交和娛樂——使用AI眼鏡進行多模態自我中心視角視頻捕捉,並同步第三方視角視頻作為參考。這項努力產生了EgoLife數據集,這是一個包含300小時的自我中心、人際互動、多視角及多模態日常生活數據集,並附有詳盡的註釋。基於此數據集,我們引入了EgoLifeQA,一套長上下文、生活導向的問答任務集,旨在通過解決實際問題(如回憶過去相關事件、監控健康習慣、提供個性化建議)來提供有意義的日常生活輔助。針對關鍵技術挑戰——(1)開發適用於自我中心數據的魯棒視聽模型,(2)實現身份識別,(3)支持跨長時間信息的長上下文問答——我們介紹了EgoButler,一個整合了EgoGPT和EgoRAG的系統。EgoGPT是一個在自我中心數據集上訓練的全模態模型,在自我中心視頻理解上達到了最先進的性能。EgoRAG是一個基於檢索的組件,支持回答超長上下文問題。我們的實驗研究驗證了它們的工作機制,並揭示了關鍵因素和瓶頸,為未來改進提供了指導。通過公開我們的數據集、模型和基準測試,我們希望激發對自我中心AI助手的進一步研究。
English
We introduce EgoLife, a project to develop an egocentric life assistant that
accompanies and enhances personal efficiency through AI-powered wearable
glasses. To lay the foundation for this assistant, we conducted a comprehensive
data collection study where six participants lived together for one week,
continuously recording their daily activities - including discussions,
shopping, cooking, socializing, and entertainment - using AI glasses for
multimodal egocentric video capture, along with synchronized third-person-view
video references. This effort resulted in the EgoLife Dataset, a comprehensive
300-hour egocentric, interpersonal, multiview, and multimodal daily life
dataset with intensive annotation. Leveraging this dataset, we introduce
EgoLifeQA, a suite of long-context, life-oriented question-answering tasks
designed to provide meaningful assistance in daily life by addressing practical
questions such as recalling past relevant events, monitoring health habits, and
offering personalized recommendations. To address the key technical challenges
of (1) developing robust visual-audio models for egocentric data, (2) enabling
identity recognition, and (3) facilitating long-context question answering over
extensive temporal information, we introduce EgoButler, an integrated system
comprising EgoGPT and EgoRAG. EgoGPT is an omni-modal model trained on
egocentric datasets, achieving state-of-the-art performance on egocentric video
understanding. EgoRAG is a retrieval-based component that supports answering
ultra-long-context questions. Our experimental studies verify their working
mechanisms and reveal critical factors and bottlenecks, guiding future
improvements. By releasing our datasets, models, and benchmarks, we aim to
stimulate further research in egocentric AI assistants.Summary
AI-Generated Summary