ロボットポリシー学習のための幾何学的行動モデル

要旨

汎用ロボットポリシーは、ユーザの指示に従いながら、物体、カメラ、ロボットの動作が3次元物理世界でどのように相互作用するかを推論しなければならない。近年の視覚-言語-動作モデル(VLA)やビデオ世界-動作モデル(WAM)は、大規模基盤モデルから強力な意味的または時間的先験知識を継承しているが、依然として主に2D画像フレームまたは2D由来の潜在空間上で動作し、接触を伴う操作に必要な3次元幾何学を暗黙のままにしている。我々は、幾何学的動作モデル(GAM)を提案する。これは、事前学習済みの幾何学的基盤モデル(GFM)を、知覚、時間予測、動作デコードのための共有基盤として直接再利用する言語条件付き操作ポリシーである。GAMはGFMを中間層で分割する。浅い層は観測エンコーダとして機能し、分割点に挿入された因果的未来予測器が、言語、自己受容感覚、動作履歴に条件付けられた将来の潜在トークンを予測する。予測された将来トークンは、その後、残りのGFMブロックを通じて特徴伝搬とデコードが行われ、単一のバックボーンが将来の幾何学と動作の両方を生成できるようになる。この設計により、GFMは最小限のアーキテクチャ変更で言語条件付きの時間的世界モデリングを備え、その豊かな幾何学的先験知識を保持する。広範なシミュレーションおよび実ロボット操作ベンチマークにおいて、GAMは現在の基盤モデル規模のベースラインよりも正確で、頑健で、高速で、軽量である。

English

Generalist robot policies must follow user instructions while reasoning about how objects, cameras, and robot actions interact in the 3D physical world. Recent vision-language-action models (VLAs) and video world-action models (WAMs) inherit strong semantic or temporal priors from large-scale foundation models, but they still operate primarily on 2D image frames or 2D-derived latent spaces, leaving implicit the 3D geometry required for contact-rich manipulation. We propose the Geometric Action Model (GAM), a language-conditioned manipulation policy that directly repurposes a pretrained geometric foundation model (GFM) as a shared substrate for perception, temporal prediction, and action decoding. GAM splits the GFM at an intermediate layer: the shallow layers serve as an observation encoder, and a causal future predictor inserted at the split layer forecasts future latent tokens conditioned on language, proprioception, and action history. The predicted future tokens are then routed through the remaining GFM blocks for feature propagation and decoding, allowing a single backbone to produce both future geometry and actions. This design equips the GFM with language-conditioned temporal world modeling through minimal architectural modification while preserving its rich geometric priors. Across a broad suite of simulation and real-robot manipulation benchmarks, GAM is more accurate, more robust, faster, and lighter than current foundation-model-scale baselines.