UniEgoMotion: 자기 중심적 모션 재구성, 예측 및 생성을 위한 통합 모델

초록

1인칭 시점에서의 인간 동작 생성 및 예측은 AR/VR 경험 강화, 인간-로봇 상호작용 개선, 보조 기술 발전, 그리고 적응형 헬스케어 솔루션 구현에 있어 핵심적인 역할을 합니다. 이를 위해서는 1인칭 시점에서의 움직임을 정확하게 예측하고 시뮬레이션할 수 있어야 합니다. 그러나 기존 방법들은 주로 3인칭 시점의 동작 합성과 구조화된 3D 장면 맥락에 초점을 맞추고 있어, 제한된 시야, 빈번한 가림 현상, 그리고 동적인 카메라로 인해 장면 인식이 어려운 실제 1인칭 환경에서는 효과적이지 못합니다. 이러한 격차를 해소하기 위해, 우리는 명시적인 3D 장면에 의존하지 않고 1인칭 이미지를 활용한 장면 인식 동작 합성을 위한 두 가지 새로운 과제인 '1인칭 동작 생성(Egocentric Motion Generation)'과 '1인칭 동작 예측(Egocentric Motion Forecasting)'을 제안합니다. 또한, 1인칭 장치에 최적화된 새로운 머리 중심 동작 표현(head-centric motion representation)을 기반으로 한 통합 조건부 동작 확산 모델인 UniEgoMotion을 제안합니다. UniEgoMotion은 간단하면서도 효과적인 설계로, 1인칭 시각 입력을 통해 동작 재구성, 예측, 생성을 통합된 프레임워크에서 지원합니다. 기존 연구들이 장면 의미를 간과한 것과 달리, 우리의 모델은 이미지 기반 장면 맥락을 효과적으로 추출하여 그럴듯한 3D 동작을 추론합니다. 학습을 용이하게 하기 위해, 우리는 EgoExo4D에서 파생된 대규모 데이터셋인 EE4D-Motion을 도입하고, 이를 가상의 정답 3D 동작 주석으로 보강했습니다. UniEgoMotion은 1인칭 동작 재구성에서 최첨단 성능을 달성하며, 단일 1인칭 이미지에서 동작을 생성하는 최초의 모델입니다. 광범위한 평가를 통해 우리의 통합 프레임워크의 효과를 입증하며, 1인칭 동작 모델링에 새로운 기준을 제시하고 1인칭 응용 분야에 새로운 가능성을 열었습니다.

English

Egocentric human motion generation and forecasting with scene-context is crucial for enhancing AR/VR experiences, improving human-robot interaction, advancing assistive technologies, and enabling adaptive healthcare solutions by accurately predicting and simulating movement from a first-person perspective. However, existing methods primarily focus on third-person motion synthesis with structured 3D scene contexts, limiting their effectiveness in real-world egocentric settings where limited field of view, frequent occlusions, and dynamic cameras hinder scene perception. To bridge this gap, we introduce Egocentric Motion Generation and Egocentric Motion Forecasting, two novel tasks that utilize first-person images for scene-aware motion synthesis without relying on explicit 3D scene. We propose UniEgoMotion, a unified conditional motion diffusion model with a novel head-centric motion representation tailored for egocentric devices. UniEgoMotion's simple yet effective design supports egocentric motion reconstruction, forecasting, and generation from first-person visual inputs in a unified framework. Unlike previous works that overlook scene semantics, our model effectively extracts image-based scene context to infer plausible 3D motion. To facilitate training, we introduce EE4D-Motion, a large-scale dataset derived from EgoExo4D, augmented with pseudo-ground-truth 3D motion annotations. UniEgoMotion achieves state-of-the-art performance in egocentric motion reconstruction and is the first to generate motion from a single egocentric image. Extensive evaluations demonstrate the effectiveness of our unified framework, setting a new benchmark for egocentric motion modeling and unlocking new possibilities for egocentric applications.

UniEgoMotion: 자기 중심적 모션 재구성, 예측 및 생성을 위한 통합 모델

UniEgoMotion: A Unified Model for Egocentric Motion Reconstruction, Forecasting, and Generation

초록

Support