StaMo:從緊湊狀態表示中無監督學習可泛化的機器人運動
StaMo: Unsupervised Learning of Generalizable Robot Motion from Compact State Representation
October 6, 2025
作者: Mingyu Liu, Jiuhe Shu, Hui Chen, Zeju Li, Canyu Zhao, Jiange Yang, Shenyuan Gao, Hao Chen, Chunhua Shen
cs.AI
摘要
在具身智能领域,一个根本性挑战在于开发表达力强且紧凑的状态表示,以实现高效的世界建模与决策制定。然而,现有方法往往难以达成这一平衡,产生的表示要么过于冗余,要么缺乏任务关键信息。我们提出了一种无监督方法,利用轻量级编码器和预训练的扩散变换器(DiT)解码器,学习高度压缩的双令牌状态表示,充分发挥其强大的生成先验。我们的表示高效、可解释,并能无缝集成到现有的基于视觉语言动作(VLA)的模型中,在LIBERO数据集上提升性能14.3%,在现实世界任务成功率上提升30%,且推理开销极小。更重要的是,我们发现通过潜在插值获得的这两个令牌之间的差异,自然形成了一种高效的潜在动作,可进一步解码为可执行的机器人动作。这一涌现能力揭示出,我们的表示在没有显式监督的情况下捕捉到了结构化动态。我们将此方法命名为StaMo,因其能够从静态图像编码的紧凑状态表示中学习可泛化的机器人运动,挑战了现有方法对复杂架构和视频数据学习潜在动作的依赖。由此产生的潜在动作还增强了策略协同训练,以10.4%的优势超越先前方法,并提升了可解释性。此外,我们的方法能够有效扩展到多种数据源,包括真实世界机器人数据、模拟数据以及人类第一人称视角视频。
English
A fundamental challenge in embodied intelligence is developing expressive and
compact state representations for efficient world modeling and decision making.
However, existing methods often fail to achieve this balance, yielding
representations that are either overly redundant or lacking in task-critical
information. We propose an unsupervised approach that learns a highly
compressed two-token state representation using a lightweight encoder and a
pre-trained Diffusion Transformer (DiT) decoder, capitalizing on its strong
generative prior. Our representation is efficient, interpretable, and
integrates seamlessly into existing VLA-based models, improving performance by
14.3% on LIBERO and 30% in real-world task success with minimal inference
overhead. More importantly, we find that the difference between these tokens,
obtained via latent interpolation, naturally serves as a highly effective
latent action, which can be further decoded into executable robot actions. This
emergent capability reveals that our representation captures structured
dynamics without explicit supervision. We name our method StaMo for its ability
to learn generalizable robotic Motion from compact State representation, which
is encoded from static images, challenging the prevalent dependence to learning
latent action on complex architectures and video data. The resulting latent
actions also enhance policy co-training, outperforming prior methods by 10.4%
with improved interpretability. Moreover, our approach scales effectively
across diverse data sources, including real-world robot data, simulation, and
human egocentric video.