Light-WAM: 상태 융합 액션 디코딩을 활용한 효율적인 세계 행동 모델

초록

세계 행동 모델(World Action Models, WAMs)은 추가적인 훈련 목표로 미래 예측을 통합하여 로봇 정책 학습을 확장하며, 정책이 작업 관련 시간적 구조를 표현에 인코딩하도록 유도한다. 현재의 WAM은 대규모 생성형 아키텍처에 의존하는 경우가 많아 훈련 비용과 추론 지연 시간이 높아 효율적인 폐루프 정책으로 배포하기 어렵다. 본 연구에서는 효율적인 로봇 조작을 위한 경량 세계 행동 모델인 Light-WAM을 제안한다. 구체적으로, 소형 비디오 백본을 기반으로 구축되며 다운샘플링된 잠재 공간에서 미래 비디오 감독을 수행함으로써 비디오 공동 훈련의 비용을 줄이면서도 표현 학습에 대한 이점을 유지한다. 행동 예측을 위해 Light-WAM은 StateFusionActionExpert를 도입하는데, 이는 여러 백본 레이어에서 적응된 상태를 읽고 학습된 쿼리 풀링을 통해 이를 융합한 후 단일 순방향 전달에서 직접 행동 청크를 예측한다. 이러한 설계는 비디오 백본 표현과 로봇 행동 간의 효율적 인터페이스를 제공하며, 무거운 생성형 행동 전문가의 필요성을 피한다. 실험 결과, Light-WAM은 LIBERO에서 강력한 성능을 유지하고 RoboTwin 2.0에서 사용 가능한 다중 작업 성능을 달성하면서도 0.44B의 학습 가능한 파라미터만을 사용한다. 또한 72.03ms의 추론 지연 시간과 4.1GiB의 최대 GPU 메모리, 그리고 개선된 훈련 처리량을 달성한다.

English

World Action Models (WAMs) extend robot policy learning by incorporating future prediction as an additional training objective, encouraging the policy to encode task-relevant temporal structure in its representations. Current WAMs often rely on large-scale generative architectures that incur high training costs and inference latency, making them difficult to deploy as efficient closed-loop policies. We propose Light-WAM, a lightweight World Action Model for efficient robot manipulation. Specifically, it is built with a compact video backbone and performs future-video supervision in a downsampled latent space, reducing the cost of video co-training while retaining its benefits for representation learning. For action prediction, Light-WAM introduces the StateFusionActionExpert, which reads adapted states from multiple backbone layers, fuses them through learned-query pooling, and directly predicts action chunks in a single forward pass. This design provides an efficient interface between video backbone representations and robot actions, avoiding the need for heavy generative action experts. Experiments demonstrate that Light-WAM maintains strong performance on LIBERO and achieves usable multi-task performance on RoboTwin 2.0, while using only 0.44B trainable parameters. It also achieves 72.03ms inference latency with 4.1GiB peak GPU memory and improved training throughput.