ChatPaper.aiChatPaper

Light-WAM:結合狀態融合動作解碼的高效世界行動模型

Light-WAM: Efficient World Action Models with State-Fusion Action Decoding

June 6, 2026
作者: Ziang Li, Dongzhou Cheng, Yibin Wang, Shiyue Wang, Xiaoyang Xu, Lingxuan Weng, Juan Wang, Jiaqi Wang
cs.AI

摘要

世界行动模型(WAMs)通过将未来预测作为额外的训练目标来扩展机器人策略学习,鼓励策略在其表征中编码与任务相关的时间结构。当前WAMs通常依赖大规模生成式架构,导致训练成本高、推理延迟大,难以作为高效的闭环策略部署。我们提出Light-WAM——一种轻量级世界行动模型,用于高效机器人操作。具体而言,该模型采用紧凑的视频骨干网络,并在降采样的潜在空间中进行未来视频监督,从而降低视频联合训练成本,同时保留其对表征学习的益处。在动作预测方面,Light-WAM引入了StateFusionActionExpert,该专家从多个骨干层读取适配状态,通过可学习查询池化进行融合,并在单次前向传播中直接预测动作块。这一设计在视频骨干网络表征与机器人动作之间提供了高效接口,避免了使用繁重的生成式动作专家。实验表明,Light-WAM在LIBERO上保持强劲性能,在RoboTwin 2.0上实现可用的多任务性能,同时仅使用0.44B可训练参数。此外,其推理延迟仅为72.03毫秒,峰值GPU内存占用4.1GiB,并提升了训练吞吐量。
English
World Action Models (WAMs) extend robot policy learning by incorporating future prediction as an additional training objective, encouraging the policy to encode task-relevant temporal structure in its representations. Current WAMs often rely on large-scale generative architectures that incur high training costs and inference latency, making them difficult to deploy as efficient closed-loop policies. We propose Light-WAM, a lightweight World Action Model for efficient robot manipulation. Specifically, it is built with a compact video backbone and performs future-video supervision in a downsampled latent space, reducing the cost of video co-training while retaining its benefits for representation learning. For action prediction, Light-WAM introduces the StateFusionActionExpert, which reads adapted states from multiple backbone layers, fuses them through learned-query pooling, and directly predicts action chunks in a single forward pass. This design provides an efficient interface between video backbone representations and robot actions, avoiding the need for heavy generative action experts. Experiments demonstrate that Light-WAM maintains strong performance on LIBERO and achieves usable multi-task performance on RoboTwin 2.0, while using only 0.44B trainable parameters. It also achieves 72.03ms inference latency with 4.1GiB peak GPU memory and improved training throughput.