EmbodMocap：面向具身智能体的非受控场景四维人机交互重建

摘要

现实世界中的人类行为天然蕴含着丰富的长期情境信息，这些信息可用于训练具身智能体进行感知、理解与行动。然而现有采集系统通常依赖昂贵的影棚设备与可穿戴装置，限制了野外场景条件人体运动数据的大规模采集。为此，我们提出EmbodMocap——一种基于两部移动iPhone的便携经济型数据采集方案。该方案的核心思想是通过联合标定双路RGB-D序列，在统一的世界度量坐标系下重建人体与场景。该方法无需固定摄像头或标记点即可实现日常环境中的度量级场景一致性采集，无缝衔接人体运动与场景几何信息。通过与光学捕捉基准数据对比，我们证明双视角配置具有显著缓解深度歧义的优势，在配准精度与重建效果上均优于单iPhone或单目模型。基于采集数据，我们赋能了三项具身智能任务：单目人-场景重建任务中，我们微调了可输出度量级世界空间对齐的人体与场景的前馈模型；基于物理的角色动画任务中，我们验证了数据可用于扩展人物-物体交互技能与场景感知运动追踪；机器人运动控制任务中，我们通过仿真到实物的强化学习训练人形机器人复现视频中的人类动作。实验结果验证了本方案的有效性及其对推进具身智能研究的贡献。

English

Human behaviors in the real world naturally encode rich, long-term contextual information that can be leveraged to train embodied agents for perception, understanding, and acting. However, existing capture systems typically rely on costly studio setups and wearable devices, limiting the large-scale collection of scene-conditioned human motion data in the wild. To address this, we propose EmbodMocap, a portable and affordable data collection pipeline using two moving iPhones. Our key idea is to jointly calibrate dual RGB-D sequences to reconstruct both humans and scenes within a unified metric world coordinate frame. The proposed method allows metric-scale and scene-consistent capture in everyday environments without static cameras or markers, bridging human motion and scene geometry seamlessly. Compared with optical capture ground truth, we demonstrate that the dual-view setting exhibits a remarkable ability to mitigate depth ambiguity, achieving superior alignment and reconstruction performance over single iphone or monocular models. Based on the collected data, we empower three embodied AI tasks: monocular human-scene-reconstruction, where we fine-tune on feedforward models that output metric-scale, world-space aligned humans and scenes; physics-based character animation, where we prove our data could be used to scale human-object interaction skills and scene-aware motion tracking; and robot motion control, where we train a humanoid robot via sim-to-real RL to replicate human motions depicted in videos. Experimental results validate the effectiveness of our pipeline and its contributions towards advancing embodied AI research.

EmbodMocap：面向具身智能体的非受控场景四维人机交互重建

EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents

摘要

Support