Light-WAM：結合狀態融合動作解碼的高效世界行動模型

摘要

世界行动模型（WAMs）通过将未来预测作为额外的训练目标来扩展机器人策略学习，鼓励策略在其表征中编码与任务相关的时间结构。当前WAMs通常依赖大规模生成式架构，导致训练成本高、推理延迟大，难以作为高效的闭环策略部署。我们提出Light-WAM——一种轻量级世界行动模型，用于高效机器人操作。具体而言，该模型采用紧凑的视频骨干网络，并在降采样的潜在空间中进行未来视频监督，从而降低视频联合训练成本，同时保留其对表征学习的益处。在动作预测方面，Light-WAM引入了StateFusionActionExpert，该专家从多个骨干层读取适配状态，通过可学习查询池化进行融合，并在单次前向传播中直接预测动作块。这一设计在视频骨干网络表征与机器人动作之间提供了高效接口，避免了使用繁重的生成式动作专家。实验表明，Light-WAM在LIBERO上保持强劲性能，在RoboTwin 2.0上实现可用的多任务性能，同时仅使用0.44B可训练参数。此外，其推理延迟仅为72.03毫秒，峰值GPU内存占用4.1GiB，并提升了训练吞吐量。

English

World Action Models (WAMs) extend robot policy learning by incorporating future prediction as an additional training objective, encouraging the policy to encode task-relevant temporal structure in its representations. Current WAMs often rely on large-scale generative architectures that incur high training costs and inference latency, making them difficult to deploy as efficient closed-loop policies. We propose Light-WAM, a lightweight World Action Model for efficient robot manipulation. Specifically, it is built with a compact video backbone and performs future-video supervision in a downsampled latent space, reducing the cost of video co-training while retaining its benefits for representation learning. For action prediction, Light-WAM introduces the StateFusionActionExpert, which reads adapted states from multiple backbone layers, fuses them through learned-query pooling, and directly predicts action chunks in a single forward pass. This design provides an efficient interface between video backbone representations and robot actions, avoiding the need for heavy generative action experts. Experiments demonstrate that Light-WAM maintains strong performance on LIBERO and achieves usable multi-task performance on RoboTwin 2.0, while using only 0.44B trainable parameters. It also achieves 72.03ms inference latency with 4.1GiB peak GPU memory and improved training throughput.