机器人策略学习的几何动作模型

摘要

通用机器人策略必须遵循用户指令，同时推理物体、摄像头和机器人动作如何在三维物理世界中交互。当前的视觉-语言-动作模型（VLA）和视频世界-动作模型（WAM）继承了大规模基础模型的强大语义或时间先验知识，但它们主要仍在二维图像帧或基于二维导出的潜在空间上运行，未能显式表达接触密集操作所需的三维几何信息。我们提出了几何动作模型（GAM），这是一种语言条件化的操作策略，直接将预训练的几何基础模型（GFM）重新用作感知、时间预测和动作解码的共享基座。GAM在GFM的中间层进行拆分：浅层作为观测编码器，而在拆分点处插入因果未来预测器，该预测器基于语言、本体感知和动作历史预测未来的潜在标记。预测的未来标记随后通过剩余的GFM模块进行特征传播和解码，从而使单个骨干网络能够同时生成未来几何信息和动作。这种设计通过最小的架构修改，为GFM赋予了语言条件化的时间世界建模能力，同时保留了其丰富的几何先验知识。在广泛的仿真和真实机器人操作基准测试中，GAM相比当前基础模型规模的基线方法更准确、更鲁棒、更快且更轻量。

English

Generalist robot policies must follow user instructions while reasoning about how objects, cameras, and robot actions interact in the 3D physical world. Recent vision-language-action models (VLAs) and video world-action models (WAMs) inherit strong semantic or temporal priors from large-scale foundation models, but they still operate primarily on 2D image frames or 2D-derived latent spaces, leaving implicit the 3D geometry required for contact-rich manipulation. We propose the Geometric Action Model (GAM), a language-conditioned manipulation policy that directly repurposes a pretrained geometric foundation model (GFM) as a shared substrate for perception, temporal prediction, and action decoding. GAM splits the GFM at an intermediate layer: the shallow layers serve as an observation encoder, and a causal future predictor inserted at the split layer forecasts future latent tokens conditioned on language, proprioception, and action history. The predicted future tokens are then routed through the remaining GFM blocks for feature propagation and decoding, allowing a single backbone to produce both future geometry and actions. This design equips the GFM with language-conditioned temporal world modeling through minimal architectural modification while preserving its rich geometric priors. Across a broad suite of simulation and real-robot manipulation benchmarks, GAM is more accurate, more robust, faster, and lighter than current foundation-model-scale baselines.