드림도조: 대규모 인간 동영상에서 학습한 범용 로봇 세계 모델

초록

다양한 환경에서 행동의 결과를 시뮬레이션할 수 있는 능력은 규모 있는 일반 지능 에이전트 개발에 혁명을 일으킬 것입니다. 그러나 이러한 세계 역학, 특히 정밀 로봇 공학 작업을 모델링하는 것은 제한된 데이터 범위와 부족한 행동 레이블로 인해 상당한 어려움을 제기합니다. 이를 위해 우리는 44,000시간의 1인칭 인간 비디오로부터 다양한 상호작용과 정밀 제어를 학습하는 기초 세계 모델인 DreamDojo를 소개합니다. 우리의 데이터 조합은 일상적인 다양한 시나리오와 객체, 기술을 아우르는 세계 모델 사전 학습을 위한 역대 최대 규모의 비디오 데이터셋을 구성합니다. 행동 레이블 부족 문제를 해결하기 위해 우리는 연속 잠재 행동을 통합 프록시 행동으로 도입하여 레이블 없는 비디오로부터의 상호작용 지식 전이를 강화합니다. 소규모 목표 로봇 데이터에 대한 사후 학습 후, DreamDojo는 물리학에 대한 강력한 이해와 정밀한 행동 제어 능력을 입증합니다. 또한 우리는 DreamDojo를 10.81 FPS의 실시간 속도로 가속화하고 컨텍스트 일관성을 추가로 개선하는 증류 파이프라인을 고안했습니다. 우리의 연구는 실시간 원격 조작, 정책 평가, 모델 기반 계획을 포함하여 생성형 세계 모델 기반의 여러 중요한 응용 분야를 가능하게 합니다. 여러 까다로운 분포 외 벤치마크에 대한 체계적인 평가는 개방형 세계의 접촉이 풍부한 작업을 시뮬레이션하는 우리 방법의 중요성을 입증하며, 범용 로봇 세계 모델을 위한 길을 열어줍니다.

English

Being able to simulate the outcomes of actions in varied environments will revolutionize the development of generalist agents at scale. However, modeling these world dynamics, especially for dexterous robotics tasks, poses significant challenges due to limited data coverage and scarce action labels. As an endeavor towards this end, we introduce DreamDojo, a foundation world model that learns diverse interactions and dexterous controls from 44k hours of egocentric human videos. Our data mixture represents the largest video dataset to date for world model pretraining, spanning a wide range of daily scenarios with diverse objects and skills. To address the scarcity of action labels, we introduce continuous latent actions as unified proxy actions, enhancing interaction knowledge transfer from unlabeled videos. After post-training on small-scale target robot data, DreamDojo demonstrates a strong understanding of physics and precise action controllability. We also devise a distillation pipeline that accelerates DreamDojo to a real-time speed of 10.81 FPS and further improves context consistency. Our work enables several important applications based on generative world models, including live teleoperation, policy evaluation, and model-based planning. Systematic evaluation on multiple challenging out-of-distribution (OOD) benchmarks verifies the significance of our method for simulating open-world, contact-rich tasks, paving the way for general-purpose robot world models.

드림도조: 대규모 인간 동영상에서 학습한 범용 로봇 세계 모델

DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

초록

Support