EmbodMocap: 4D Mens-Scène Reconstructie in de Vrije Leefomgeving voor Belichaamde Agenten

Samenvatting

Menselijk gedrag in de echte wereld codeert van nature rijke, langetermijn contextuele informatie die kan worden benut om belichaamde agents te trainen voor waarneming, begrip en handeling. Bestaande capturesystemen zijn echter doorgaans afhankelijk van kostbare studiovoorzieningen en draagbare apparaten, wat de grootschalige verzameling van scenegedane menselijke bewegingsdata in de praktijk beperkt. Om dit aan te pakken, stellen wij EmbodMocap voor, een draagbaar en betaalbaar datacollectieproces met twee bewegende iPhones. Onze kernidee is het gezamenlijk kalibreren van dubbele RGB-D-reeksen om zowel mensen als scènes binnen een verenigd metrisch wereldcoördinatenstelsel te reconstrueren. De voorgestelde methode maakt metrische schaal- en sceneconsistente capture mogelijk in alledaagse omgevingen zonder statische camera's of markers, en verbindt menselijke beweging en scènegeometrie naadloos. In vergelijking met optische capture ground truth tonen we aan dat de dual-view-opstelling een opmerkelijke capaciteit vertoont om diepte-ambiguïteit te verminderen, met superieure uitlijning en reconstructieprestaties ten opzichte van single iPhone of monocular modellen. Gebaseerd op de verzamelde data ondersteunen we drie belichaamde AI-taken: monocular human-scene-reconstruction, waarbij we feedforward-modellen finetunen die mensen en scènes op metrische schaal en uitgelijnd in de wereldruimte outputten; physics-based character animation, waarbij we aantonen dat onze data gebruikt kan worden om mens-object interactievaardigheden en scene-aware motion tracking op te schalen; en robot motion control, waarbij we een humanoïde robot trainen via sim-to-real RL om menselijke bewegingen uit video's na te bootsen. Experimentele resultaten valideren de effectiviteit van onze pijplijn en haar bijdragen aan de vooruitgang van belichaamd AI-onderzoek.

English

Human behaviors in the real world naturally encode rich, long-term contextual information that can be leveraged to train embodied agents for perception, understanding, and acting. However, existing capture systems typically rely on costly studio setups and wearable devices, limiting the large-scale collection of scene-conditioned human motion data in the wild. To address this, we propose EmbodMocap, a portable and affordable data collection pipeline using two moving iPhones. Our key idea is to jointly calibrate dual RGB-D sequences to reconstruct both humans and scenes within a unified metric world coordinate frame. The proposed method allows metric-scale and scene-consistent capture in everyday environments without static cameras or markers, bridging human motion and scene geometry seamlessly. Compared with optical capture ground truth, we demonstrate that the dual-view setting exhibits a remarkable ability to mitigate depth ambiguity, achieving superior alignment and reconstruction performance over single iphone or monocular models. Based on the collected data, we empower three embodied AI tasks: monocular human-scene-reconstruction, where we fine-tune on feedforward models that output metric-scale, world-space aligned humans and scenes; physics-based character animation, where we prove our data could be used to scale human-object interaction skills and scene-aware motion tracking; and robot motion control, where we train a humanoid robot via sim-to-real RL to replicate human motions depicted in videos. Experimental results validate the effectiveness of our pipeline and its contributions towards advancing embodied AI research.

EmbodMocap: 4D Mens-Scène Reconstructie in de Vrije Leefomgeving voor Belichaamde Agenten

EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents

Samenvatting

Support