GRAIL：從三維資產與視頻先驗生成人形機器人的移動操作

摘要

擴展人形機器人的移動操作需要跨不同物體、全身動作及場景幾何的機器人相容示範，然而遙操作與動作捕捉難以規模化，因為每次資料收集均依賴於實體設置、穿戴設備的演員及機器人操作。我們提出 GRAIL，這是一套在部署前完全虛擬化的數位生成流程：它結合 3D 資產、模擬器就緒場景及來自影片基礎模型（VFM）的先驗知識，無需重建實體環境或遙操作機器人即可合成互動。不同於還原未經約束的真實世界影片，GRAIL 從完全指定的 3D 配置出發——在影片生成前即已知物體幾何、相機參數、度量尺度、環境深度及機器人比例的角色，並在重建過程中重複使用這些資訊。此特權設定能更有效調節 4D 復原，透過基於模型的物體追蹤、人體運動估計及互動感知最佳化，重建出深度模糊與形態錯配較少的度量 4D 人-物互動（HOI）軌跡。我們將復原的運動重新對應至人形機器人，並訓練互補的任務通用追蹤器：一個用於操作的物體感知潛在適應器，以及一個用於地形穿越的場景感知追蹤器。GRAIL 生成超過 20,000 個序列，涵蓋撿取、物體操作、坐下及地形穿越。僅使用 GRAIL 生成的資料，我們透過模擬到真實（sim-to-real）流程訓練以自我為中心的視覺策略，並部署於 Unitree G1 人形機器人上，在真實世界多樣物體撿取任務中達成 84% 的成功率，而在爬樓梯任務中則達到 90% 的成功率。

English

Scaling humanoid loco-manipulation requires robot-compatible demonstrations across diverse objects, whole-body motions, and scene geometries, but teleoperation and motion capture are difficult to scale because each collection depends on physical setups, instrumented actors, and robot operation. We present GRAIL, a digital generation pipeline that remains fully virtual until deployment: it composes 3D assets, simulator-ready scenes, and priors from video foundation models (VFMs) to synthesize interactions without rebuilding physical environments or teleoperating the robot. Rather than reconstructing unconstrained in-the-wild videos, GRAIL starts from fully specified 3D configurations in which object geometry, camera parameters, metric scale, environment depth, and a robot-proportioned character are known before video generation and reused during reconstruction. This privileged setup better conditions 4D recovery, allowing model-based object tracking, human motion estimation, and interaction-aware optimization to reconstruct metric 4D human-object interaction (HOI) trajectories with reduced depth ambiguity and morphology mismatch. We retarget the recovered motions to a humanoid robot and train complementary task-general trackers: an object-aware latent adaptor for manipulation and a scene-aware tracker for terrain traversal. GRAIL produces over 20,000 sequences spanning pick-up, object manipulation, sitting, and terrain traversal. Using only GRAIL-generated data, we train egocentric visual policies through a sim-to-real pipeline and deploy them on a Unitree G1 humanoid, achieving 84\% real-world success on diverse object pick-up and 90\% success on stair-climbing.