GRAIL: 从3D资产与视频先验生成人形机器人移动操作

摘要

扩展人形机器人的移动操作能力，需要机器人兼容的演示数据，涵盖多样化的物体、全身运动以及场景几何结构。然而，遥操作和动作捕捉难以扩展，因为每次数据采集都依赖于物理设备、穿戴传感器的演员和机器人操作。我们提出GRAIL，一个全虚拟的数字生成流水线，仅在部署前保持全虚拟状态：它通过组合3D资产、仿真就绪场景以及视频基础模型(VFMs)的先验知识来合成交互，无需重建物理环境或对机器人进行遥操作。与从无约束的野外视频中直接重建不同，GRAIL从完全指定的3D配置开始，在视频生成之前便已知物体几何、相机参数、度量尺度、环境深度以及机器人比例的角色，并在重建过程中重复利用这些信息。这种特权设置更好地约束了4D恢复，使得基于模型的物体跟踪、人体运动估计以及交互感知优化能够重建度量的4D人-物交互(HOI)轨迹，同时减少深度模糊和形态不匹配问题。我们将恢复的运动重定到人形机器人上，并训练互补的任务通用跟踪器：一个用于操作的物体感知潜在适配器和一个用于地形穿越的场景感知跟踪器。GRAIL生成了超过20000个序列，涵盖拾取、物体操作、坐姿以及地形穿越。仅使用GRAIL生成的数据，我们通过仿真到现实(sim-to-real)流水线训练第一人称视觉策略，并将其部署在Unitree G1人形机器人上，在多样化物体拾取任务中实现了84%的真实世界成功率，在爬楼梯任务中实现了90%的成功率。

English

Scaling humanoid loco-manipulation requires robot-compatible demonstrations across diverse objects, whole-body motions, and scene geometries, but teleoperation and motion capture are difficult to scale because each collection depends on physical setups, instrumented actors, and robot operation. We present GRAIL, a digital generation pipeline that remains fully virtual until deployment: it composes 3D assets, simulator-ready scenes, and priors from video foundation models (VFMs) to synthesize interactions without rebuilding physical environments or teleoperating the robot. Rather than reconstructing unconstrained in-the-wild videos, GRAIL starts from fully specified 3D configurations in which object geometry, camera parameters, metric scale, environment depth, and a robot-proportioned character are known before video generation and reused during reconstruction. This privileged setup better conditions 4D recovery, allowing model-based object tracking, human motion estimation, and interaction-aware optimization to reconstruct metric 4D human-object interaction (HOI) trajectories with reduced depth ambiguity and morphology mismatch. We retarget the recovered motions to a humanoid robot and train complementary task-general trackers: an object-aware latent adaptor for manipulation and a scene-aware tracker for terrain traversal. GRAIL produces over 20,000 sequences spanning pick-up, object manipulation, sitting, and terrain traversal. Using only GRAIL-generated data, we train egocentric visual policies through a sim-to-real pipeline and deploy them on a Unitree G1 humanoid, achieving 84\% real-world success on diverse object pick-up and 90\% success on stair-climbing.