Light-WAM: 状態融合アクションデコードによる効率的な世界行動モデル

要旨

ワールドアクションモデル（WAM）は、将来予測を追加の学習目的として組み込むことでロボットのポリシー学習を拡張し、ポリシーがその表現においてタスクに関連する時間的構造をエンコードすることを促進する。現在のWAMは多くの場合、大規模な生成アーキテクチャに依存しており、高い学習コストと推論レイテンシが発生するため、効率的な閉ループポリシーとして展開することが困難である。我々は、効率的なロボット操作のための軽量ワールドアクションモデルであるLight-WAMを提案する。具体的には、コンパクトなビデオバックボーンで構築され、ダウンサンプリングされた潜在空間で将来ビデオの教師信号を適用することで、ビデオ共同学習のコストを削減しつつ、表現学習におけるその利点を維持する。行動予測のために、Light-WAMはStateFusionActionExpertを導入する。これは複数のバックボーン層から適応された状態を読み取り、学習されたクエリプーリングを通じてそれらを融合し、単一のフォワードパスで行動チャンクを直接予測する。この設計は、ビデオバックボーン表現とロボット行動の間の効率的なインターフェースを提供し、重い生成的行動エキスパートを不要にする。実験により、Light-WAMはLIBEROで強力な性能を維持し、RoboTwin 2.0で実用的なマルチタスク性能を達成しつつ、学習可能パラメータはわずか0.44Bであることが示された。また、推論レイテンシ72.03ms、ピークGPUメモリ4.1GiBを達成し、学習スループットも向上している。

English

World Action Models (WAMs) extend robot policy learning by incorporating future prediction as an additional training objective, encouraging the policy to encode task-relevant temporal structure in its representations. Current WAMs often rely on large-scale generative architectures that incur high training costs and inference latency, making them difficult to deploy as efficient closed-loop policies. We propose Light-WAM, a lightweight World Action Model for efficient robot manipulation. Specifically, it is built with a compact video backbone and performs future-video supervision in a downsampled latent space, reducing the cost of video co-training while retaining its benefits for representation learning. For action prediction, Light-WAM introduces the StateFusionActionExpert, which reads adapted states from multiple backbone layers, fuses them through learned-query pooling, and directly predicts action chunks in a single forward pass. This design provides an efficient interface between video backbone representations and robot actions, avoiding the need for heavy generative action experts. Experiments demonstrate that Light-WAM maintains strong performance on LIBERO and achieves usable multi-task performance on RoboTwin 2.0, while using only 0.44B trainable parameters. It also achieves 72.03ms inference latency with 4.1GiB peak GPU memory and improved training throughput.