StaMo: コンパクトな状態表現からの汎用可能なロボット動作の教師なし学習

要旨

具現化された知能における根本的な課題は、効率的な世界モデリングと意思決定のための表現力豊かでコンパクトな状態表現を開発することです。しかし、既存の手法はしばしばこのバランスを達成できず、過剰に冗長であるか、タスクに重要な情報が欠如した表現を生み出してしまいます。本論文では、軽量なエンコーダと事前学習済みのDiffusion Transformer（DiT）デコーダを活用し、その強力な生成事前知識を利用して、高度に圧縮された2トークンの状態表現を学習する教師なしアプローチを提案します。我々の表現は効率的で解釈可能であり、既存のVLAベースのモデルにシームレスに統合され、LIBEROでは14.3%、実世界のタスク成功率では30%の性能向上を達成し、推論オーバーヘッドを最小限に抑えています。さらに重要なことに、潜在補間を通じて得られるこれらのトークン間の差分が、非常に効果的な潜在行動として自然に機能し、実行可能なロボット行動にデコードできることがわかりました。この創発的な能力は、我々の表現が明示的な監督なしに構造化されたダイナミクスを捉えていることを示しています。我々はこの手法をStaMoと名付けました。これは、静的な画像からエンコードされたコンパクトな状態表現から一般化可能なロボットのモーションを学習する能力に由来し、複雑なアーキテクチャやビデオデータに依存する従来の潜在行動学習への依存に挑戦するものです。結果として得られる潜在行動は、ポリシーの共学習も強化し、解釈可能性を向上させながら、従来手法を10.4%上回ります。さらに、我々のアプローチは、実世界のロボットデータ、シミュレーション、人間のエゴセントリックビデオなど、多様なデータソースにわたって効果的にスケールします。

English

A fundamental challenge in embodied intelligence is developing expressive and compact state representations for efficient world modeling and decision making. However, existing methods often fail to achieve this balance, yielding representations that are either overly redundant or lacking in task-critical information. We propose an unsupervised approach that learns a highly compressed two-token state representation using a lightweight encoder and a pre-trained Diffusion Transformer (DiT) decoder, capitalizing on its strong generative prior. Our representation is efficient, interpretable, and integrates seamlessly into existing VLA-based models, improving performance by 14.3% on LIBERO and 30% in real-world task success with minimal inference overhead. More importantly, we find that the difference between these tokens, obtained via latent interpolation, naturally serves as a highly effective latent action, which can be further decoded into executable robot actions. This emergent capability reveals that our representation captures structured dynamics without explicit supervision. We name our method StaMo for its ability to learn generalizable robotic Motion from compact State representation, which is encoded from static images, challenging the prevalent dependence to learning latent action on complex architectures and video data. The resulting latent actions also enhance policy co-training, outperforming prior methods by 10.4% with improved interpretability. Moreover, our approach scales effectively across diverse data sources, including real-world robot data, simulation, and human egocentric video.

StaMo: コンパクトな状態表現からの汎用可能なロボット動作の教師なし学習

StaMo: Unsupervised Learning of Generalizable Robot Motion from Compact State Representation

要旨

Support