MaskGWM: ビデオマスク再構築による汎用可能な運転世界モデル

要旨

行動から環境変化を予測する世界モデルは、強力な汎化能力を持つ自動運転モデルにとって不可欠である。現在主流の運転世界モデルは主にビデオ予測モデルに基づいて構築されている。これらのモデルは高度な拡散ベースの生成器を用いて高精細なビデオシーケンスを生成できるが、予測期間と全体的な汎化能力に制約がある。本論文では、生成損失とMAEスタイルの特徴レベルのコンテキスト学習を組み合わせることでこの問題を解決することを探求する。具体的には、以下の3つの主要な設計を通じてこの目標を具現化する：(1) 追加のマスク構築タスクで訓練された、よりスケーラブルなDiffusion Transformer (DiT) 構造。(2) マスク再構築と生成的拡散プロセスの間の曖昧な関係に対処するため、拡散関連のマークトークンを考案。(3) MAEにおけるマスク付き自己注意ではなく、シフト付き自己注意のための行単位のマスクを利用することで、マスク構築タスクを空間-時間領域に拡張。さらに、このマスク設計に合わせるため、行単位のクロスビューモジュールを採用。以上の改良に基づき、ビデオマスク再構築を具現化した汎用的な運転世界モデルであるMaskGWMを提案する。我々のモデルには2つのバリエーションがある：長期間予測に焦点を当てたMaskGWM-longと、マルチビュー生成に特化したMaskGWM-mview。標準ベンチマークでの包括的な実験により、提案手法の有効性が検証され、これにはNusceneデータセットの通常検証、OpenDV-2Kデータセットの長期間ロールアウト、Waymoデータセットのゼロショット検証が含まれる。これらのデータセットにおける定量的な指標は、我々の手法が最先端の運転世界モデルを大幅に改善することを示している。

English

World models that forecast environmental changes from actions are vital for autonomous driving models with strong generalization. The prevailing driving world model mainly build on video prediction model. Although these models can produce high-fidelity video sequences with advanced diffusion-based generator, they are constrained by their predictive duration and overall generalization capabilities. In this paper, we explore to solve this problem by combining generation loss with MAE-style feature-level context learning. In particular, we instantiate this target with three key design: (1) A more scalable Diffusion Transformer (DiT) structure trained with extra mask construction task. (2) we devise diffusion-related mask tokens to deal with the fuzzy relations between mask reconstruction and generative diffusion process. (3) we extend mask construction task to spatial-temporal domain by utilizing row-wise mask for shifted self-attention rather than masked self-attention in MAE. Then, we adopt a row-wise cross-view module to align with this mask design. Based on above improvement, we propose MaskGWM: a Generalizable driving World Model embodied with Video Mask reconstruction. Our model contains two variants: MaskGWM-long, focusing on long-horizon prediction, and MaskGWM-mview, dedicated to multi-view generation. Comprehensive experiments on standard benchmarks validate the effectiveness of the proposed method, which contain normal validation of Nuscene dataset, long-horizon rollout of OpenDV-2K dataset and zero-shot validation of Waymo dataset. Quantitative metrics on these datasets show our method notably improving state-of-the-art driving world model.

MaskGWM: ビデオマスク再構築による汎用可能な運転世界モデル

MaskGWM: A Generalizable Driving World Model with Video Mask Reconstruction

要旨

Support