MaskGWM：一種基於視頻遮罩重建的通用駕駛世界模型

摘要

能夠根據行動預測環境變化的世界模型，對於具備強大泛化能力的自動駕駛模型至關重要。當前主流的駕駛世界模型主要基於視頻預測模型。儘管這些模型能夠利用先進的基於擴散的生成器產生高保真度的視頻序列，但它們在預測時長和整體泛化能力方面仍受到限制。本文探討通過將生成損失與MAE風格的特徵層次上下文學習相結合來解決這一問題。具體而言，我們通過三個關鍵設計來實現這一目標：（1）採用更具可擴展性的擴散變壓器（DiT）結構，並通過額外的掩碼構建任務進行訓練。（2）設計與擴散相關的掩碼標記，以處理掩碼重建與生成擴散過程之間的模糊關係。（3）將掩碼構建任務擴展到時空域，利用行向掩碼進行移位自注意力而非MAE中的掩碼自注意力。隨後，我們採用行向跨視圖模塊來與此掩碼設計保持一致。基於上述改進，我們提出了MaskGWM：一種體現視頻掩碼重建的通用駕駛世界模型。我們的模型包含兩個變體：專注於長時預測的MaskGWM-long，以及致力於多視圖生成的MaskGWM-mview。在標準基準上的全面實驗驗證了所提方法的有效性，包括Nuscene數據集的常規驗證、OpenDV-2K數據集的長時推演以及Waymo數據集的零樣本驗證。這些數據集上的定量指標表明，我們的方法顯著提升了當前最先進的駕駛世界模型。

English

World models that forecast environmental changes from actions are vital for autonomous driving models with strong generalization. The prevailing driving world model mainly build on video prediction model. Although these models can produce high-fidelity video sequences with advanced diffusion-based generator, they are constrained by their predictive duration and overall generalization capabilities. In this paper, we explore to solve this problem by combining generation loss with MAE-style feature-level context learning. In particular, we instantiate this target with three key design: (1) A more scalable Diffusion Transformer (DiT) structure trained with extra mask construction task. (2) we devise diffusion-related mask tokens to deal with the fuzzy relations between mask reconstruction and generative diffusion process. (3) we extend mask construction task to spatial-temporal domain by utilizing row-wise mask for shifted self-attention rather than masked self-attention in MAE. Then, we adopt a row-wise cross-view module to align with this mask design. Based on above improvement, we propose MaskGWM: a Generalizable driving World Model embodied with Video Mask reconstruction. Our model contains two variants: MaskGWM-long, focusing on long-horizon prediction, and MaskGWM-mview, dedicated to multi-view generation. Comprehensive experiments on standard benchmarks validate the effectiveness of the proposed method, which contain normal validation of Nuscene dataset, long-horizon rollout of OpenDV-2K dataset and zero-shot validation of Waymo dataset. Quantitative metrics on these datasets show our method notably improving state-of-the-art driving world model.

MaskGWM：一種基於視頻遮罩重建的通用駕駛世界模型

MaskGWM: A Generalizable Driving World Model with Video Mask Reconstruction

摘要

Support