機器人策略學習的幾何動作模型

摘要

通用型機器人策略必須遵循使用者指令，同時推理物體、相機與機器人動作在三維物理世界中的互動關係。近期的視覺-語言-行動模型（VLA）與視訊世界-行動模型（WAM）雖繼承大型基礎模型的強大語義或時間先驗，但其運作仍主要依賴二維影像幀或二維衍生的潛在空間，隱含了接觸密集型操作所需的三維幾何資訊。為此，我們提出幾何行動模型（GAM），這是一種語言條件化的操作策略，直接將預訓練的幾何基礎模型（GFM）重新定位為感知、時間預測與行動解碼的共享基礎架構。GAM在GFM的中間層進行分割：淺層作為觀測編碼器，並在分割層插入因果未來預測器，根據語言、本體感知與行動歷史預測未來的潛在標記。這些預測的未來標記隨後通過其餘GFM區塊進行特徵傳播與解碼，使單一骨幹網路能同時產出未來幾何資訊與行動。此設計僅需最小的架構修改，即可為GFM配備語言條件化的時間世界模型，同時保留其豐富的幾何先驗知識。在廣泛的模擬與真實機器人操作基準測試中，GAM在準確度、穩健性、執行速度與輕量化方面，均超越當前基礎模型規模的基準方法。

English

Generalist robot policies must follow user instructions while reasoning about how objects, cameras, and robot actions interact in the 3D physical world. Recent vision-language-action models (VLAs) and video world-action models (WAMs) inherit strong semantic or temporal priors from large-scale foundation models, but they still operate primarily on 2D image frames or 2D-derived latent spaces, leaving implicit the 3D geometry required for contact-rich manipulation. We propose the Geometric Action Model (GAM), a language-conditioned manipulation policy that directly repurposes a pretrained geometric foundation model (GFM) as a shared substrate for perception, temporal prediction, and action decoding. GAM splits the GFM at an intermediate layer: the shallow layers serve as an observation encoder, and a causal future predictor inserted at the split layer forecasts future latent tokens conditioned on language, proprioception, and action history. The predicted future tokens are then routed through the remaining GFM blocks for feature propagation and decoding, allowing a single backbone to produce both future geometry and actions. This design equips the GFM with language-conditioned temporal world modeling through minimal architectural modification while preserving its rich geometric priors. Across a broad suite of simulation and real-robot manipulation benchmarks, GAM is more accurate, more robust, faster, and lighter than current foundation-model-scale baselines.