Discrete-WAM: 世界-政策学習のための統合的離散的視覚・行動トークン編集

要旨

自動運転では、自車の行動が周囲世界の進展をどのように形作るかについての推論が必要である。しかし、ほとんどのエンドツーエンド手法は直接的な状態から行動へのマッピングに依存し、相関を捉える一方で、行動条件付きの動的過程を明示的にモデル化していない。対照的に、連続潜在世界モデルは、反事実的未来にわたる因果推論のための構成的構造を欠くことが多い。本稿では、将来の視覚状態と自車行動を整列した離散トークンとして表現する統一的な潜在視覚・行動世界方策であるDiscrete-WAMを導入する。これにより、代替的な未来にわたる構成的因果推論が可能となる。この統一的な離散整列に基づき、Discrete-WAMは統一生成タスクを備えた共有離散拡散フレームワークを確立し、世界モデリング、世界行動方策、階層的決定可能方策を統一的に定式化し、多様な運転シナリオにわたる構成的汎化を支援する。大規模自動運転ベンチマークにおける実験は、Discrete-WAMが制御可能な生成と反事実推論を支援しつつ競争力のある性能を達成し、より信頼性の高い意思決定への原理的な道筋を提供することを示している。

English

Autonomous driving requires reasoning about how ego actions shape the evolution of the surrounding world. However, most end-to-end methods rely on direct state-to-action mappings, capturing correlations without explicitly modeling action-conditioned dynamics. Conversely, continuous-latent world models often lack compositional structure for causal reasoning across counterfactual futures. We introduce Discrete-WAM, a unified latent vision-action world policy that represents future visual states and ego actions as aligned discrete tokens, enabling compositional causal reasoning across alternative futures. Built upon this unified discrete alignment, Discrete-WAM establishes a shared discrete diffusion framework with unified generative tasks, jointly formulating world modeling, world-action policy, and hierarchical decision-enabled policy, supporting compositional generalization across diverse driving scenarios. Experiments on large-scale autonomous-driving benchmarks show that Discrete-WAM achieves competitive performance while supporting controllable generation and counterfactual reasoning, offering a principled path toward more reliable decision-making.