Light-WAM: Efficiënte Wereldactiemodellen met Toestandsfusie-Actiedecodering

Samenvatting

Wereldactiemodellen (WAM's) breiden het leren van robotbeleid uit door toekomstvoorspelling als extra trainingsdoel op te nemen, waardoor het beleid wordt aangemoedigd om taakrelevante temporele structuur in zijn representaties te coderen. Huidige WAM's zijn vaak afhankelijk van grootschalige generatieve architecturen, die hoge trainingskosten en inferentielatentie met zich meebrengen, waardoor ze moeilijk inzetbaar zijn als efficiënte closed-loop-beleidsregels. Wij stellen Light-WAM voor, een lichtgewicht World Action Model voor efficiënte robotmanipulatie. Het is specifiek gebouwd met een compacte videobackbone en voert toekomst-video-supervisie uit in een gedownsamplede latente ruimte, waardoor de kosten van video-co-training worden verlaagd terwijl de voordelen voor representatieleren behouden blijven. Voor actievoorspelling introduceert Light-WAM de StateFusionActionExpert, die aangepaste toestanden uit meerdere backbone-lagen leest, deze samenvoegt via learned-query-pooling, en direct action chunks voorspelt in een enkele forward pass. Dit ontwerp biedt een efficiënte interface tussen videobackbone-representaties en robotacties, waardoor de noodzaak voor zware generatieve actie-experts wordt vermeden. Experimenten tonen aan dat Light-WAM sterke prestaties levert op LIBERO en bruikbare multi-taskprestaties behaalt op RoboTwin 2.0, terwijl het slechts 0,44B trainbare parameters gebruikt. Het behaalt ook een inferentielatentie van 72,03 ms met een piek-GPU-geheugen van 4,1 GiB en een verbeterde trainingsdoorvoer.

English

World Action Models (WAMs) extend robot policy learning by incorporating future prediction as an additional training objective, encouraging the policy to encode task-relevant temporal structure in its representations. Current WAMs often rely on large-scale generative architectures that incur high training costs and inference latency, making them difficult to deploy as efficient closed-loop policies. We propose Light-WAM, a lightweight World Action Model for efficient robot manipulation. Specifically, it is built with a compact video backbone and performs future-video supervision in a downsampled latent space, reducing the cost of video co-training while retaining its benefits for representation learning. For action prediction, Light-WAM introduces the StateFusionActionExpert, which reads adapted states from multiple backbone layers, fuses them through learned-query pooling, and directly predicts action chunks in a single forward pass. This design provides an efficient interface between video backbone representations and robot actions, avoiding the need for heavy generative action experts. Experiments demonstrate that Light-WAM maintains strong performance on LIBERO and achieves usable multi-task performance on RoboTwin 2.0, while using only 0.44B trainable parameters. It also achieves 72.03ms inference latency with 4.1GiB peak GPU memory and improved training throughput.