StaMo：从紧凑状态表示中无监督学习可泛化的机器人运动

摘要

在具身智能领域，一个根本性挑战在于开发表达力强且紧凑的状态表示，以实现高效的世界建模与决策制定。然而，现有方法往往难以达成这一平衡，产生的表示要么冗余过度，要么缺失任务关键信息。我们提出了一种无监督方法，利用轻量级编码器与预训练的扩散变换器（DiT）解码器，学习高度压缩的双令牌状态表示，充分发挥其强大的生成先验优势。我们的表示高效、可解释，并能无缝集成到现有的基于视觉语言动作（VLA）的模型中，在LIBERO基准上提升性能14.3%，在现实世界任务成功率上提升30%，且推理开销极小。更重要的是，我们发现通过潜在插值获得的这些令牌之间的差异，自然形成了高效的潜在动作，可进一步解码为可执行的机器人动作。这一涌现能力揭示出，我们的表示在无显式监督的情况下捕捉到了结构化动态。我们将此方法命名为StaMo，因其能够从静态图像编码的紧凑状态表示中学习到可泛化的机器人运动，挑战了当前依赖复杂架构和视频数据学习潜在动作的主流做法。由此产生的潜在动作还增强了策略协同训练，以10.4%的优势超越先前方法，并提升了可解释性。此外，我们的方法能有效扩展到多种数据源，包括真实机器人数据、仿真及人类第一人称视角视频。

English

A fundamental challenge in embodied intelligence is developing expressive and compact state representations for efficient world modeling and decision making. However, existing methods often fail to achieve this balance, yielding representations that are either overly redundant or lacking in task-critical information. We propose an unsupervised approach that learns a highly compressed two-token state representation using a lightweight encoder and a pre-trained Diffusion Transformer (DiT) decoder, capitalizing on its strong generative prior. Our representation is efficient, interpretable, and integrates seamlessly into existing VLA-based models, improving performance by 14.3% on LIBERO and 30% in real-world task success with minimal inference overhead. More importantly, we find that the difference between these tokens, obtained via latent interpolation, naturally serves as a highly effective latent action, which can be further decoded into executable robot actions. This emergent capability reveals that our representation captures structured dynamics without explicit supervision. We name our method StaMo for its ability to learn generalizable robotic Motion from compact State representation, which is encoded from static images, challenging the prevalent dependence to learning latent action on complex architectures and video data. The resulting latent actions also enhance policy co-training, outperforming prior methods by 10.4% with improved interpretability. Moreover, our approach scales effectively across diverse data sources, including real-world robot data, simulation, and human egocentric video.

StaMo：从紧凑状态表示中无监督学习可泛化的机器人运动

StaMo: Unsupervised Learning of Generalizable Robot Motion from Compact State Representation

摘要

Support