GRAIL: 3D 자산 및 비디오 사전 정보로부터 휴머노이드 이동-조작 생성

초록

휴머노이드 보행 조작의 확장을 위해서는 다양한 객체, 전신 동작, 장면 형상을 아우르는 로봇 호환 시연 데이터가 필요하지만, 원격 조작과 모션 캡처는 각 데이터 수집이 물리적 설정, 계측된 행위자, 로봇 조작에 의존하기 때문에 확장이 어렵습니다. 본 논문에서는 배포 전까지 완전히 가상 환경에서 진행되는 디지털 생성 파이프라인인 GRAIL을 제안합니다. GRAIL은 3D 자산, 시뮬레이터 준비 장면, 비디오 파운데이션 모델(VFM)의 사전 정보를 조합하여 물리적 환경을 재구성하거나 로봇을 원격 조작하지 않고도 상호작용을 합성합니다. GRAIL은 제약 없는 실제 동영상을 재구성하는 대신, 객체 형상, 카메라 파라미터, 미터법 스케일, 환경 깊이, 로봇 비례 캐릭터가 비디오 생성 전에 이미 알려져 있고 재구성 과정에서 재사용되는 완전히 명시된 3D 설정에서 시작합니다. 이러한 특권적 설정은 4차원 복원을 더 잘 조건화하여, 모델 기반 객체 추적, 인간 동작 추정, 상호작용 인식 최적화를 통해 깊이 모호성과 형태 불일치가 줄어든 미터법 4차원 인간-객체 상호작용(HOI) 궤적을 복원할 수 있게 합니다. 복원된 동작을 휴머노이드 로봇에 리타겟팅하고, 조작을 위한 객체 인식 잠재 어댑터와 지형 이동을 위한 장면 인식 추적기라는 상호 보완적인 작업 일반 추적기를 훈련합니다. GRAIL은 집기, 객체 조작, 앉기, 지형 이동을 포함한 20,000개 이상의 시퀀스를 생성합니다. GRAIL 생성 데이터만을 사용하여 시뮬레이션-실제 전이 파이프라인을 통해 자기 시점 시각 정책을 훈련하고, 이를 Unitree G1 휴머노이드에 배포하여 다양한 객체 집기에서 84%의 실제 성공률과 계단 오르기에서 90%의 성공률을 달성합니다.

English

Scaling humanoid loco-manipulation requires robot-compatible demonstrations across diverse objects, whole-body motions, and scene geometries, but teleoperation and motion capture are difficult to scale because each collection depends on physical setups, instrumented actors, and robot operation. We present GRAIL, a digital generation pipeline that remains fully virtual until deployment: it composes 3D assets, simulator-ready scenes, and priors from video foundation models (VFMs) to synthesize interactions without rebuilding physical environments or teleoperating the robot. Rather than reconstructing unconstrained in-the-wild videos, GRAIL starts from fully specified 3D configurations in which object geometry, camera parameters, metric scale, environment depth, and a robot-proportioned character are known before video generation and reused during reconstruction. This privileged setup better conditions 4D recovery, allowing model-based object tracking, human motion estimation, and interaction-aware optimization to reconstruct metric 4D human-object interaction (HOI) trajectories with reduced depth ambiguity and morphology mismatch. We retarget the recovered motions to a humanoid robot and train complementary task-general trackers: an object-aware latent adaptor for manipulation and a scene-aware tracker for terrain traversal. GRAIL produces over 20,000 sequences spanning pick-up, object manipulation, sitting, and terrain traversal. Using only GRAIL-generated data, we train egocentric visual policies through a sim-to-real pipeline and deploy them on a Unitree G1 humanoid, achieving 84\% real-world success on diverse object pick-up and 90\% success on stair-climbing.