GRAIL: 3Dアセットとビデオ事前情報からの人型ロボット移動操作の生成

要旨

人型ロボットの移動操作をスケーリングするには、多様な物体、全身動作、シーン形状にわたるロボット互換のデモンストレーションが必要であるが、遠隔操作やモーションキャプチャは、各収集が物理的セットアップ、計測器を装着した被験者、ロボット操作に依存するため、スケーリングが困難である。我々はGRAILを提案する。これは展開まで完全に仮想的なままのデジタル生成パイプラインであり、3Dアセット、シミュレータ対応シーン、動画基盤モデル（VFM）からの事前知識を組み合わせて、物理環境を再構築したりロボットを遠隔操作したりすることなくインタラクションを合成する。制約のない実環境動画を再構成する代わりに、GRAILは完全に指定された3D構成から開始する。この構成では、物体の形状、カメラパラメータ、メートルスケール、環境深度、およびロボットと同寸のキャラクタが動画生成前に既知であり、再構成時に再利用される。この特権的な設定は4次元復元をより良好に条件付け、モデルベースの物体追跡、人間動作推定、およびインタラクションを考慮した最適化を可能にし、深度の曖昧さと形態の不一致を低減したメートル単位の4次元人-物体インタラクション（HOI）軌道を再構成する。復元された動作を人型ロボットにリターゲティングし、補完的なタスク汎用トラッカー、すなわち操作のための物体認識潜在アダプタと地形移動のためのシーン認識トラッカーを訓練する。GRAILは、ピックアップ、物体操作、着座、地形移動にわたる20,000以上のシーケンスを生成する。GRAILが生成したデータのみを使用して、シミュレーションから実世界へのパイプラインを通じて自己中心視覚ポリシーを訓練し、Unitree G1人型ロボットに展開した結果、多様な物体のピックアップで84％、階段昇降で90％の実世界成功率を達成した。

English

Scaling humanoid loco-manipulation requires robot-compatible demonstrations across diverse objects, whole-body motions, and scene geometries, but teleoperation and motion capture are difficult to scale because each collection depends on physical setups, instrumented actors, and robot operation. We present GRAIL, a digital generation pipeline that remains fully virtual until deployment: it composes 3D assets, simulator-ready scenes, and priors from video foundation models (VFMs) to synthesize interactions without rebuilding physical environments or teleoperating the robot. Rather than reconstructing unconstrained in-the-wild videos, GRAIL starts from fully specified 3D configurations in which object geometry, camera parameters, metric scale, environment depth, and a robot-proportioned character are known before video generation and reused during reconstruction. This privileged setup better conditions 4D recovery, allowing model-based object tracking, human motion estimation, and interaction-aware optimization to reconstruct metric 4D human-object interaction (HOI) trajectories with reduced depth ambiguity and morphology mismatch. We retarget the recovered motions to a humanoid robot and train complementary task-general trackers: an object-aware latent adaptor for manipulation and a scene-aware tracker for terrain traversal. GRAIL produces over 20,000 sequences spanning pick-up, object manipulation, sitting, and terrain traversal. Using only GRAIL-generated data, we train egocentric visual policies through a sim-to-real pipeline and deploy them on a Unitree G1 humanoid, achieving 84\% real-world success on diverse object pick-up and 90\% success on stair-climbing.