통합 세계 모델: 대규모 로봇 데이터셋 사전 학습을 위한 비디오와 액션 확산의 결합

초록

모방 학습은 범용 로봇을 구축하기 위한 유망한 접근 방식으로 부상했습니다. 그러나 고품질 전문가 시연 데이터에 의존해야 한다는 점 때문에 대규모 로봇 파운데이션 모델에 모방 학습을 확장하는 것은 여전히 어려운 과제로 남아 있습니다. 한편, 다양한 환경과 행동을 담은 방대한 양의 비디오 데이터가 쉽게 구할 수 있는 형태로 존재합니다. 이 데이터는 실제 세계의 역학과 에이전트-환경 상호작용에 대한 풍부한 정보를 제공합니다. 그러나 대부분의 현대적 방법에 필요한 행동 주석이 부족하기 때문에 이 데이터를 모방 학습에 직접 활용하는 것은 어려운 것으로 입증되었습니다. 본 연구에서는 비디오와 행동 데이터를 모두 활용하여 정책 학습을 가능하게 하는 통합 세계 모델(Unified World Models, UWM) 프레임워크를 제시합니다. 구체적으로, UWM은 통합 트랜스포머 아키텍처 내에서 행동 확산 과정과 비디오 확산 과정을 통합하며, 각 모달리티는 독립적인 확산 타임스텝에 의해 제어됩니다. 우리는 각 확산 타임스텝을 단순히 제어함으로써 UWM이 정책, 순방향 역학, 역방향 역학, 비디오 생성기를 유연하게 표현할 수 있음을 보여줍니다. 시뮬레이션과 실제 실험을 통해 다음과 같은 결과를 확인했습니다: (1) UWM은 역학 및 행동 예측을 포함한 대규모 다중 작업 로봇 데이터셋에서 효과적인 사전 학습을 가능하게 하여 모방 학습보다 더 일반화 가능하고 견고한 정책을 생성하며, (2) UWM은 모달리티별 확산 타임스텝의 독립적 제어를 통해 행동이 없는 비디오 데이터로부터의 학습을 자연스럽게 촉진하여 미세 조정된 정책의 성능을 더욱 향상시킵니다. 우리의 결과는 UWM이 대규모 이질적 데이터셋을 활용하여 확장 가능한 로봇 학습을 위한 유망한 단계를 제공하며, 종종 분리된 패러다임인 모방 학습과 세계 모델링 간의 간단한 통합을 제공함을 시사합니다. 비디오와 코드는 https://weirdlabuw.github.io/uwm/에서 확인할 수 있습니다.

English

Imitation learning has emerged as a promising approach towards building generalist robots. However, scaling imitation learning for large robot foundation models remains challenging due to its reliance on high-quality expert demonstrations. Meanwhile, large amounts of video data depicting a wide range of environments and diverse behaviors are readily available. This data provides a rich source of information about real-world dynamics and agent-environment interactions. Leveraging this data directly for imitation learning, however, has proven difficult due to the lack of action annotation required for most contemporary methods. In this work, we present Unified World Models (UWM), a framework that allows for leveraging both video and action data for policy learning. Specifically, a UWM integrates an action diffusion process and a video diffusion process within a unified transformer architecture, where independent diffusion timesteps govern each modality. We show that by simply controlling each diffusion timestep, UWM can flexibly represent a policy, a forward dynamics, an inverse dynamics, and a video generator. Through simulated and real-world experiments, we show that: (1) UWM enables effective pretraining on large-scale multitask robot datasets with both dynamics and action predictions, resulting in more generalizable and robust policies than imitation learning, (2) UWM naturally facilitates learning from action-free video data through independent control of modality-specific diffusion timesteps, further improving the performance of finetuned policies. Our results suggest that UWM offers a promising step toward harnessing large, heterogeneous datasets for scalable robot learning, and provides a simple unification between the often disparate paradigms of imitation learning and world modeling. Videos and code are available at https://weirdlabuw.github.io/uwm/.

통합 세계 모델: 대규모 로봇 데이터셋 사전 학습을 위한 비디오와 액션 확산의 결합

Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets

초록

Support