MobileEgo Anywhere：面向商用硬體的長時程自我中心資料開放基礎設施

摘要

近年來視覺語言動作模型的進展，驅動了對大規模第一人稱資料集的迫切需求。然而現有資料集常受限於短暫的片段時長（通常僅數分鐘），無法捕捉複雜機器人任務執行所需的長時程時間依賴性。為填補此缺口，我們提出 MobileEgo Anywhere 框架，旨在利用商用行動硬體收集穩定、時長達一小時以上的第一人稱軌跡。我們藉助現代智慧型手機普遍搭載的感測器模組，提供高保真、長期的相機姿態追蹤，有效消除傳統機器人資料收集所需的高昂硬體門檻。我們的貢獻有三：(1) 釋出包含200小時多樣化長形式第一人稱資料且具持續狀態追蹤的新穎資料集；(2) 開源一款行動應用程式，讓任何使用者都能記錄第一人稱資料；(3) 提供完整處理管線，將原始行動裝置錄製內容轉換為標準化、可直接用於訓練的格式，以支援視覺語言動作模型與基礎模型研究。透過普及資料收集流程，本工作使得在多元全球環境中大規模獲取長時程資料成為可能，從而加速可泛化機器人策略的開發。

English

The recent advancement of Vision Language Action (VLA) models has driven a critical demand for large scale egocentric datasets. However, existing datasets are often limited by short episode durations, typically spanning only a few minutes, which fails to capture the long horizon temporal dependencies necessary for complex robotic task execution. To bridge this gap, we present MobileEgo Anywhere, a framework designed to facilitate the collection of robust, hour plus egocentric trajectories using commodity mobile hardware. We leverage the ubiquitous sensor suites of modern smartphones to provide high fidelity, long term camera pose tracking, effectively removing the high hardware barriers associated with traditional robotics data collection. Our contributions are three fold: (1) we release a novel dataset comprising 200 hours of diverse, long form egocentric data with persistent state tracking; (2) we open source a mobile application that enables any user to record egocentric data, and (3) we provide a comprehensive processing pipeline to convert raw mobile captures into standardized, training ready formats for Vision Language Action model and foundation model research. By democratizing the data collection process, this work enables the massive scale acquisition of long horizon data across varied global environments, accelerating the development of generalizable robotic policies.