X-Sim: 실세계-시뮬레이션-실세계를 통한 교차 구현체 학습

초록

인간 비디오는 로봇 조작 정책을 훈련시키기 위한 확장 가능한 방법을 제공하지만, 표준 모방 학습 알고리즘에 필요한 동작 레이블이 부족합니다. 기존의 교차 구현체 접근법은 인간의 움직임을 로봇 동작으로 매핑하려고 시도하지만, 구현체 간 차이가 크면 종종 실패합니다. 우리는 객체의 움직임을 밀집하고 전이 가능한 신호로 사용하여 로봇 정책을 학습하는 실재-시뮬레이션-실재 프레임워크인 X-Sim을 제안합니다. X-Sim은 RGBD 인간 비디오에서 사실적인 시뮬레이션을 재구성하고 객체 궤적을 추적하여 객체 중심 보상을 정의하는 것으로 시작합니다. 이러한 보상은 시뮬레이션 내에서 강화 학습(RL) 정책을 훈련시키는 데 사용됩니다. 학습된 정책은 다양한 시점과 조명으로 렌더링된 합성 롤아웃을 사용하여 이미지 조건부 확산 정책으로 정제됩니다. 실재 세계로 전이하기 위해 X-Sim은 배포 중에 실재와 시뮬레이션 관측을 정렬하는 온라인 도메인 적응 기술을 도입합니다. 중요한 점은 X-Sim이 로봇 원격 조작 데이터를 전혀 필요로 하지 않는다는 것입니다. 우리는 2개의 환경에서 5개의 조작 작업에 걸쳐 이를 평가하고 다음과 같은 결과를 보여줍니다: (1) 손 추적 및 시뮬레이션-실재 기준선보다 평균 30%의 작업 진행도를 개선, (2) 데이터 수집 시간을 10분의 1로 줄여도 행동 복제와 동등한 성능, (3) 새로운 카메라 시점과 테스트 시 변경 사항에 일반화. 코드와 비디오는 https://portal-cornell.github.io/X-Sim/에서 확인할 수 있습니다.

English

Human videos offer a scalable way to train robot manipulation policies, but lack the action labels needed by standard imitation learning algorithms. Existing cross-embodiment approaches try to map human motion to robot actions, but often fail when the embodiments differ significantly. We propose X-Sim, a real-to-sim-to-real framework that uses object motion as a dense and transferable signal for learning robot policies. X-Sim starts by reconstructing a photorealistic simulation from an RGBD human video and tracking object trajectories to define object-centric rewards. These rewards are used to train a reinforcement learning (RL) policy in simulation. The learned policy is then distilled into an image-conditioned diffusion policy using synthetic rollouts rendered with varied viewpoints and lighting. To transfer to the real world, X-Sim introduces an online domain adaptation technique that aligns real and simulated observations during deployment. Importantly, X-Sim does not require any robot teleoperation data. We evaluate it across 5 manipulation tasks in 2 environments and show that it: (1) improves task progress by 30% on average over hand-tracking and sim-to-real baselines, (2) matches behavior cloning with 10x less data collection time, and (3) generalizes to new camera viewpoints and test-time changes. Code and videos are available at https://portal-cornell.github.io/X-Sim/.

X-Sim: 실세계-시뮬레이션-실세계를 통한 교차 구현체 학습

X-Sim: Cross-Embodiment Learning via Real-to-Sim-to-Real

초록

Support