UniEgoMotion: エゴセントリックな運動の再構築、予測、生成のための統合モデル

要旨

エゴセントリックな人間の動作生成と予測は、AR/VR体験の向上、人間とロボットのインタラクションの改善、支援技術の進展、そして一人称視点からの正確な動作予測とシミュレーションを通じた適応型医療ソリューションの実現において重要である。しかし、既存の手法は主に構造化された3Dシーンコンテキストを用いた三人称視点の動作合成に焦点を当てており、視野の制限、頻繁な遮蔽、動的なカメラによるシーン認識の困難さが生じる現実世界のエゴセントリックな環境ではその効果が限られている。このギャップを埋めるため、我々はエゴセントリックな動作生成とエゴセントリックな動作予測という二つの新たなタスクを提案し、明示的な3Dシーンに依存せずに一人称画像を用いたシーン認識型の動作合成を実現する。我々は、エゴセントリックデバイスに特化した新しい頭部中心の動作表現を備えた統一的な条件付き動作拡散モデルであるUniEgoMotionを提案する。UniEgoMotionのシンプルでありながら効果的な設計は、一人称視覚入力を基にしたエゴセントリックな動作再構築、予測、生成を統一的なフレームワークでサポートする。従来の研究がシーンセマンティクスを軽視していたのに対し、我々のモデルは画像ベースのシーンコンテキストを効果的に抽出し、妥当な3D動作を推論する。トレーニングを容易にするため、我々はEgoExo4Dから派生した大規模データセットEE4D-Motionを導入し、疑似グラウンドトゥルースの3D動作アノテーションを追加した。UniEgoMotionはエゴセントリックな動作再構築において最先端の性能を達成し、単一のエゴセントリック画像から動作を生成する初めてのモデルである。広範な評価により、我々の統一フレームワークの有効性が実証され、エゴセントリックな動作モデリングの新たなベンチマークを設定し、エゴセントリックアプリケーションの新たな可能性を切り開いた。

English

Egocentric human motion generation and forecasting with scene-context is crucial for enhancing AR/VR experiences, improving human-robot interaction, advancing assistive technologies, and enabling adaptive healthcare solutions by accurately predicting and simulating movement from a first-person perspective. However, existing methods primarily focus on third-person motion synthesis with structured 3D scene contexts, limiting their effectiveness in real-world egocentric settings where limited field of view, frequent occlusions, and dynamic cameras hinder scene perception. To bridge this gap, we introduce Egocentric Motion Generation and Egocentric Motion Forecasting, two novel tasks that utilize first-person images for scene-aware motion synthesis without relying on explicit 3D scene. We propose UniEgoMotion, a unified conditional motion diffusion model with a novel head-centric motion representation tailored for egocentric devices. UniEgoMotion's simple yet effective design supports egocentric motion reconstruction, forecasting, and generation from first-person visual inputs in a unified framework. Unlike previous works that overlook scene semantics, our model effectively extracts image-based scene context to infer plausible 3D motion. To facilitate training, we introduce EE4D-Motion, a large-scale dataset derived from EgoExo4D, augmented with pseudo-ground-truth 3D motion annotations. UniEgoMotion achieves state-of-the-art performance in egocentric motion reconstruction and is the first to generate motion from a single egocentric image. Extensive evaluations demonstrate the effectiveness of our unified framework, setting a new benchmark for egocentric motion modeling and unlocking new possibilities for egocentric applications.

UniEgoMotion: エゴセントリックな運動の再構築、予測、生成のための統合モデル

UniEgoMotion: A Unified Model for Egocentric Motion Reconstruction, Forecasting, and Generation

要旨

Support