Snelle LeWorldModel

Samenvatting

Joint-Embedding Predictive Architectures (JEPAs), waaronder het recente LeWorldModel (LeWM), zijn een veelbelovend fundament geworden voor reconstructievrije visuele wereldmodellen. Voor visuele planning evalueert LeWM echter kandidaat-actiereeksen door herhaaldelijk een lokaal eenstaps latent overgangsmodel toe te passen. Deze autoregressieve uitrol maakt planning computationeel duur en stelt de voorspelde trajectorie bloot aan opgehoopte latente fouten naarmate de horizon groeit. Wij stellen Fast LeWorldModel (Fast-LeWM) voor, een snel latent wereldmodel dat herhaalde lokale uitrol vervangt door actie-prefix-voorspelling. Gegeven de huidige latente toestand en een kandidaat-actiereeks, codeert Fast-LeWM de prefixen en voorspelt parallel de toekomstige latente toestanden die worden bereikt na het uitvoeren van die prefixen. Door actie-prefixen de basale voorspellingseenheid te maken, modelleert Fast-LeWM direct de effecten van acties die over meerdere horizonten in verschillende mate zijn geaccumuleerd. Dit prefix-niveau toezicht dwingt het model om te leren hoe toestanden continu evolueren onder verschillende actie-prefixen, in plaats van alleen eenstaps toestandsovergangen te fitten. Tijdens planning kan de voorspeller het laatste prefix-token uit de gecodeerde actiereeks gebruiken om de corresponderende toekomstige latente toestand te evalueren zonder expliciet door elke tussentijdse voorgestelde toestand te rollen. Over meerdere taken verbetert Fast-LeWM het gemiddelde succes ten opzichte van LeWM, terwijl de plantijd aanzienlijk wordt verminderd, wat leidt tot een lager open-lus latent verlies waarvan de groei significant langzamer wordt naarmate de uitrolhorizon toeneemt.

English

Joint-Embedding Predictive Architectures (JEPAs), including recent LeWorldModel (LeWM), have become a promising foundation for reconstruction-free visual world models. For visual planning, however, LeWM evaluates candidate action sequences by repeatedly applying a local one-step latent transition model. This autoregressive rollout makes planning computationally expensive and exposes the predicted trajectory to accumulated latent errors as the horizon grows. We propose Fast LeWorldModel (Fast-LeWM), a fast latent world model that replaces repeated local rollout with action-prefix prediction. Given the current latent and a candidate action sequence, Fast-LeWM encodes its prefixes and predicts the future latents reached after executing those prefixes in parallel. By making action prefixes the basic prediction unit, Fast-LeWM directly models action effects accumulated to different extents over multiple horizons. This prefix-level supervision forces the model to learn how states continuously evolve under different action prefixes, rather than only fitting one-step state transitions. During planning, the predictor can use the last prefix token from the encoded action sequence to evaluate the corresponding future latent without explicitly rolling through each intermediate imagined state. Across multiple tasks, Fast-LeWM improves average success over LeWM while substantially reducing planning time, achieving lower open-loop latent loss whose growth becomes significantly slower as the rollout horizon increases.