ChatPaper.aiChatPaper

GRAIL:從三維資產與視頻先驗生成人形機器人的移動操作

GRAIL: Generating Humanoid Loco-Manipulation from 3D Assets and Video Priors

June 3, 2026
作者: Tianyi Xie, Haotian Zhang, Jinhyung Park, Zi Wang, Bowen Wen, Jiefeng Li, Xueting Li, Qingwei Ben, Haoyang Weng, Yufei Ye, David Minor, Tingwu Wang, Chenfanfu Jiang, Sanja Fidler, Jan Kautz, Linxi Fan, Yuke Zhu, Zhengyi Luo, Umar Iqbal, Ye Yuan
cs.AI

摘要

擴展人形機器人的移動操作需要跨不同物體、全身動作及場景幾何的機器人相容示範,然而遙操作與動作捕捉難以規模化,因為每次資料收集均依賴於實體設置、穿戴設備的演員及機器人操作。我們提出 GRAIL,這是一套在部署前完全虛擬化的數位生成流程:它結合 3D 資產、模擬器就緒場景及來自影片基礎模型(VFM)的先驗知識,無需重建實體環境或遙操作機器人即可合成互動。不同於還原未經約束的真實世界影片,GRAIL 從完全指定的 3D 配置出發——在影片生成前即已知物體幾何、相機參數、度量尺度、環境深度及機器人比例的角色,並在重建過程中重複使用這些資訊。此特權設定能更有效調節 4D 復原,透過基於模型的物體追蹤、人體運動估計及互動感知最佳化,重建出深度模糊與形態錯配較少的度量 4D 人-物互動(HOI)軌跡。我們將復原的運動重新對應至人形機器人,並訓練互補的任務通用追蹤器:一個用於操作的物體感知潛在適應器,以及一個用於地形穿越的場景感知追蹤器。GRAIL 生成超過 20,000 個序列,涵蓋撿取、物體操作、坐下及地形穿越。僅使用 GRAIL 生成的資料,我們透過模擬到真實(sim-to-real)流程訓練以自我為中心的視覺策略,並部署於 Unitree G1 人形機器人上,在真實世界多樣物體撿取任務中達成 84% 的成功率,而在爬樓梯任務中則達到 90% 的成功率。
English
Scaling humanoid loco-manipulation requires robot-compatible demonstrations across diverse objects, whole-body motions, and scene geometries, but teleoperation and motion capture are difficult to scale because each collection depends on physical setups, instrumented actors, and robot operation. We present GRAIL, a digital generation pipeline that remains fully virtual until deployment: it composes 3D assets, simulator-ready scenes, and priors from video foundation models (VFMs) to synthesize interactions without rebuilding physical environments or teleoperating the robot. Rather than reconstructing unconstrained in-the-wild videos, GRAIL starts from fully specified 3D configurations in which object geometry, camera parameters, metric scale, environment depth, and a robot-proportioned character are known before video generation and reused during reconstruction. This privileged setup better conditions 4D recovery, allowing model-based object tracking, human motion estimation, and interaction-aware optimization to reconstruct metric 4D human-object interaction (HOI) trajectories with reduced depth ambiguity and morphology mismatch. We retarget the recovered motions to a humanoid robot and train complementary task-general trackers: an object-aware latent adaptor for manipulation and a scene-aware tracker for terrain traversal. GRAIL produces over 20,000 sequences spanning pick-up, object manipulation, sitting, and terrain traversal. Using only GRAIL-generated data, we train egocentric visual policies through a sim-to-real pipeline and deploy them on a Unitree G1 humanoid, achieving 84\% real-world success on diverse object pick-up and 90\% success on stair-climbing.