快速LeWorldModel

摘要

联合嵌入预测架构（JEPAs），包括近期的LeWorld模型（LeWM），已成为无重建视觉世界模型的有力基础。然而，在视觉规划中，LeWM通过重复应用局部单步潜在状态转移模型来评估候选动作序列。这种自回归展开使得规划计算成本高昂，并且随着规划步长的增加，预测轨迹会累积潜在误差。我们提出快速LeWorld模型（Fast-LeWM），这是一种快速潜在世界模型，它用动作前缀预测替代了重复的局部展开。给定当前潜在状态和候选动作序列，Fast-LeWM对其前缀进行编码，并并行预测执行这些前缀后所达到的未来潜在状态。通过将动作前缀作为基本预测单元，Fast-LeWM直接建模不同动作前缀在不同规划步长下累积的动作效果。这种前缀级别的监督迫使模型学习状态如何在不同动作前缀下连续演化，而不仅仅是拟合单步状态转移。在规划过程中，预测器可以利用编码动作序列中的最后一个前缀标记来评估相应的未来潜在状态，而无需显式遍历每个中间想象状态。在多个任务中，Fast-LeWM相比LeWM提升了平均成功率，同时大幅减少了规划时间，并实现了更低的开环潜在损失，且其增长随规划步长增加而显著减缓。

English

Joint-Embedding Predictive Architectures (JEPAs), including recent LeWorldModel (LeWM), have become a promising foundation for reconstruction-free visual world models. For visual planning, however, LeWM evaluates candidate action sequences by repeatedly applying a local one-step latent transition model. This autoregressive rollout makes planning computationally expensive and exposes the predicted trajectory to accumulated latent errors as the horizon grows. We propose Fast LeWorldModel (Fast-LeWM), a fast latent world model that replaces repeated local rollout with action-prefix prediction. Given the current latent and a candidate action sequence, Fast-LeWM encodes its prefixes and predicts the future latents reached after executing those prefixes in parallel. By making action prefixes the basic prediction unit, Fast-LeWM directly models action effects accumulated to different extents over multiple horizons. This prefix-level supervision forces the model to learn how states continuously evolve under different action prefixes, rather than only fitting one-step state transitions. During planning, the predictor can use the last prefix token from the encoded action sequence to evaluate the corresponding future latent without explicitly rolling through each intermediate imagined state. Across multiple tasks, Fast-LeWM improves average success over LeWM while substantially reducing planning time, achieving lower open-loop latent loss whose growth becomes significantly slower as the rollout horizon increases.