MaskGWM: 비디오 마스크 재구성을 통한 일반화 가능한 주행 세계 모델

초록

행동으로부터 환경 변화를 예측하는 세계 모델(world model)은 강력한 일반화 능력을 갖춘 자율 주행 모델에 필수적입니다. 현재 주류를 이루는 주행 세계 모델은 주로 비디오 예측 모델에 기반을 두고 있습니다. 이러한 모델들은 고급 확산 기반 생성기를 통해 고해상도의 비디오 시퀀스를 생성할 수 있지만, 예측 기간과 전반적인 일반화 능력에 제약을 받고 있습니다. 본 논문에서는 생성 손실(generation loss)과 MAE(Masked Autoencoder) 스타일의 특징 수준(context-level) 학습을 결합하여 이 문제를 해결하고자 합니다. 특히, 이 목표를 실현하기 위해 세 가지 주요 설계를 도입했습니다: (1) 추가적인 마스크 구성(mask construction) 작업으로 학습된 확장성이 더 높은 Diffusion Transformer(DiT) 구조, (2) 마스크 재구성과 생성적 확산 과정 간의 모호한 관계를 처리하기 위해 확산 관련 마스크 토큰을 설계, (3) MAE의 마스크된 자기 주의(masked self-attention) 대신 행 단위 마스크(row-wise mask)를 활용하여 공간-시간 영역으로 마스크 구성 작업을 확장. 또한, 이 마스크 설계와 일치시키기 위해 행 단위 교차 뷰(row-wise cross-view) 모듈을 채택했습니다. 이러한 개선 사항을 바탕으로, 비디오 마스크 재구성을 구현한 일반화 가능한 주행 세계 모델인 MaskGWM을 제안합니다. 우리의 모델은 장기 예측에 초점을 맞춘 MaskGWM-long과 다중 뷰 생성에 전념한 MaskGWM-mview 두 가지 변형을 포함합니다. 표준 벤치마크에서의 포괄적인 실험을 통해 제안된 방법의 효과를 검증했으며, 이는 Nuscene 데이터셋의 일반 검증, OpenDV-2K 데이터셋의 장기 롤아웃(long-horizon rollout), 그리고 Waymo 데이터셋의 제로샷(zero-shot) 검증을 포함합니다. 이러한 데이터셋에서의 정량적 지표는 우리의 방법이 최신 주행 세계 모델을 크게 개선했음을 보여줍니다.

English

World models that forecast environmental changes from actions are vital for autonomous driving models with strong generalization. The prevailing driving world model mainly build on video prediction model. Although these models can produce high-fidelity video sequences with advanced diffusion-based generator, they are constrained by their predictive duration and overall generalization capabilities. In this paper, we explore to solve this problem by combining generation loss with MAE-style feature-level context learning. In particular, we instantiate this target with three key design: (1) A more scalable Diffusion Transformer (DiT) structure trained with extra mask construction task. (2) we devise diffusion-related mask tokens to deal with the fuzzy relations between mask reconstruction and generative diffusion process. (3) we extend mask construction task to spatial-temporal domain by utilizing row-wise mask for shifted self-attention rather than masked self-attention in MAE. Then, we adopt a row-wise cross-view module to align with this mask design. Based on above improvement, we propose MaskGWM: a Generalizable driving World Model embodied with Video Mask reconstruction. Our model contains two variants: MaskGWM-long, focusing on long-horizon prediction, and MaskGWM-mview, dedicated to multi-view generation. Comprehensive experiments on standard benchmarks validate the effectiveness of the proposed method, which contain normal validation of Nuscene dataset, long-horizon rollout of OpenDV-2K dataset and zero-shot validation of Waymo dataset. Quantitative metrics on these datasets show our method notably improving state-of-the-art driving world model.

MaskGWM: 비디오 마스크 재구성을 통한 일반화 가능한 주행 세계 모델

MaskGWM: A Generalizable Driving World Model with Video Mask Reconstruction

초록

Support