Discrete-WAM：面向世界策略学习的统一离散视觉-动作令牌编辑

摘要

自动驾驶需要对自车行为如何影响周围世界的演变进行推理。然而，大多数端到端方法依赖于直接的状态到动作映射，虽然捕捉了相关性，但未能显式建模以动作为条件的世界动态。相比之下，连续潜空间世界模型往往缺乏用于跨反事实未来进行因果推理的组合结构。我们提出了Discrete-WAM，一种统一的潜空间视觉-动作世界策略，将未来视觉状态和自车行为表示为对齐的离散标记，从而能够跨多个替代未来进行组合因果推理。基于这一统一离散对齐，Discrete-WAM建立了共享的离散扩散框架与统一的生成任务，共同构建世界建模、世界-动作策略和分层决策策略，支持跨多样化驾驶场景的组合泛化。在大规模自动驾驶基准上的实验表明，Discrete-WAM在实现竞争性能的同时，支持可控生成和反事实推理，为更可靠的决策制定提供了一条原则性路径。

English

Autonomous driving requires reasoning about how ego actions shape the evolution of the surrounding world. However, most end-to-end methods rely on direct state-to-action mappings, capturing correlations without explicitly modeling action-conditioned dynamics. Conversely, continuous-latent world models often lack compositional structure for causal reasoning across counterfactual futures. We introduce Discrete-WAM, a unified latent vision-action world policy that represents future visual states and ego actions as aligned discrete tokens, enabling compositional causal reasoning across alternative futures. Built upon this unified discrete alignment, Discrete-WAM establishes a shared discrete diffusion framework with unified generative tasks, jointly formulating world modeling, world-action policy, and hierarchical decision-enabled policy, supporting compositional generalization across diverse driving scenarios. Experiments on large-scale autonomous-driving benchmarks show that Discrete-WAM achieves competitive performance while supporting controllable generation and counterfactual reasoning, offering a principled path toward more reliable decision-making.