Discrete-WAM：統一離散視覺-動作標記編輯，用於世界-策略學習

摘要

自動駕駛需要推理自車行為如何影響周邊世界的演變。然而，多數端到端方法依賴於直接從狀態到動作的映射，僅捕捉相關性而未明確建模以動作條件為基礎的動態。與此同時，連續潛在空間的世界模型往往缺乏用於跨反事實未來進行因果推理的組合結構。我們提出 Discrete-WAM，這是一個統一的潛在視覺-動作世界策略，能將未來視覺狀態與自車行為對齊為離散標記，從而實現跨替代未來的組合因果推理。基於此統一的離散對齊機制，Discrete-WAM建立了一套共享離散擴散框架與統一的生成任務，共同整合世界建模、世界-動作策略及具分層決策能力的策略，支援跨多樣駕駛場景的組合泛化。在大規模自動駕駛基準測試中的實驗結果顯示，Discrete-WAM在維持競爭性能的同時，具備可控生成與反事實推理能力，為實現更可靠的決策提供了一條具原則性的發展路徑。

English

Autonomous driving requires reasoning about how ego actions shape the evolution of the surrounding world. However, most end-to-end methods rely on direct state-to-action mappings, capturing correlations without explicitly modeling action-conditioned dynamics. Conversely, continuous-latent world models often lack compositional structure for causal reasoning across counterfactual futures. We introduce Discrete-WAM, a unified latent vision-action world policy that represents future visual states and ego actions as aligned discrete tokens, enabling compositional causal reasoning across alternative futures. Built upon this unified discrete alignment, Discrete-WAM establishes a shared discrete diffusion framework with unified generative tasks, jointly formulating world modeling, world-action policy, and hierarchical decision-enabled policy, supporting compositional generalization across diverse driving scenarios. Experiments on large-scale autonomous-driving benchmarks show that Discrete-WAM achieves competitive performance while supporting controllable generation and counterfactual reasoning, offering a principled path toward more reliable decision-making.