UniEgoMotion:一個統一模型用於自我中心運動重建、預測與生成
UniEgoMotion: A Unified Model for Egocentric Motion Reconstruction, Forecasting, and Generation
August 2, 2025
作者: Chaitanya Patel, Hiroki Nakamura, Yuta Kyuragi, Kazuki Kozuka, Juan Carlos Niebles, Ehsan Adeli
cs.AI
摘要
以場景為背景的自我中心人體運動生成與預測,對於提升AR/VR體驗、改善人機互動、推進輔助技術發展以及實現適應性醫療解決方案至關重要,其核心在於從第一人稱視角精準預測與模擬運動。然而,現有方法主要集中於結合結構化3D場景的第三人稱運動合成,這在實際的自我中心場景中效果受限,因為受限的視野、頻繁的遮擋以及動態攝像頭阻礙了場景感知。為彌補這一差距,我們引入了自我中心運動生成與自我中心運動預測兩項新任務,它們利用第一人視圖像進行場景感知的運動合成,無需依賴顯式的3D場景。我們提出了UniEgoMotion,這是一個統一的條件運動擴散模型,配備了專為自我中心設備設計的新穎頭部中心運動表示法。UniEgoMotion簡潔而高效的設計,在統一框架內支持從第一人視覺輸入進行自我中心運動重建、預測與生成。與以往忽視場景語義的工作不同,我們的模型有效提取基於圖像的場景上下文,以推斷出合理的3D運動。為促進訓練,我們引入了EE4D-Motion,這是一個源自EgoExo4D的大規模數據集,並增強了偽真實3D運動註釋。UniEgoMotion在自我中心運動重建上達到了最先進的性能,並首次實現了從單一自我中心圖像生成運動。廣泛的評估證明了我們統一框架的有效性,為自我中心運動建模設立了新基準,並為自我中心應用開辟了新的可能性。
English
Egocentric human motion generation and forecasting with scene-context is
crucial for enhancing AR/VR experiences, improving human-robot interaction,
advancing assistive technologies, and enabling adaptive healthcare solutions by
accurately predicting and simulating movement from a first-person perspective.
However, existing methods primarily focus on third-person motion synthesis with
structured 3D scene contexts, limiting their effectiveness in real-world
egocentric settings where limited field of view, frequent occlusions, and
dynamic cameras hinder scene perception. To bridge this gap, we introduce
Egocentric Motion Generation and Egocentric Motion Forecasting, two novel tasks
that utilize first-person images for scene-aware motion synthesis without
relying on explicit 3D scene. We propose UniEgoMotion, a unified conditional
motion diffusion model with a novel head-centric motion representation tailored
for egocentric devices. UniEgoMotion's simple yet effective design supports
egocentric motion reconstruction, forecasting, and generation from first-person
visual inputs in a unified framework. Unlike previous works that overlook scene
semantics, our model effectively extracts image-based scene context to infer
plausible 3D motion. To facilitate training, we introduce EE4D-Motion, a
large-scale dataset derived from EgoExo4D, augmented with pseudo-ground-truth
3D motion annotations. UniEgoMotion achieves state-of-the-art performance in
egocentric motion reconstruction and is the first to generate motion from a
single egocentric image. Extensive evaluations demonstrate the effectiveness of
our unified framework, setting a new benchmark for egocentric motion modeling
and unlocking new possibilities for egocentric applications.