로봇 정책 학습을 위한 기하학적 행동 모델

초록

범용 로봇 정책은 객체, 카메라, 로봇 동작이 3차원 물리 세계에서 어떻게 상호작용하는지 추론하는 동시에 사용자 명령을 따라야 한다. 최근의 시각-언어-행동 모델(VLAs)과 비디오 세계-행동 모델(WAMs)은 대규모 기반 모델로부터 강력한 의미론적 또는 시간적 사전 지식을 상속받지만, 여전히 주로 2차원 이미지 프레임 또는 2차원 유래 잠재 공간에서 작동하여 접촉이 많은 조작에 필요한 3차원 기하학을 암시적으로 남겨둔다. 우리는 기하학적 행동 모델(GAM)을 제안한다. 이는 사전 훈련된 기하학적 기반 모델(GFM)을 지각, 시간적 예측 및 행동 디코딩을 위한 공유 기반으로 직접 재사용하는 언어 조건부 조작 정책이다. GAM은 GFM을 중간 계층에서 분할한다. 얕은 계층은 관측 인코더 역할을 하고, 분할 계층에 삽입된 인과적 미래 예측기는 언어, 고유수용감각 및 행동 이력을 조건으로 미래 잠재 토큰을 예측한다. 예측된 미래 토큰은 이후 나머지 GFM 블록을 통해 특징 전파 및 디코딩을 위해 전달되어, 단일 백본이 미래 기하학과 행동을 모두 생성할 수 있게 한다. 이 설계는 최소한의 구조적 수정을 통해 GFM에 언어 조건부 시간적 세계 모델링을 제공하면서도 풍부한 기하학적 사전 지식을 유지한다. 다양한 시뮬레이션 및 실제 로봇 조작 벤치마크에서 GAM은 현재의 기반 모델 규모 기준선보다 더 정확하고, 더 강건하며, 더 빠르고, 더 가볍다.

English

Generalist robot policies must follow user instructions while reasoning about how objects, cameras, and robot actions interact in the 3D physical world. Recent vision-language-action models (VLAs) and video world-action models (WAMs) inherit strong semantic or temporal priors from large-scale foundation models, but they still operate primarily on 2D image frames or 2D-derived latent spaces, leaving implicit the 3D geometry required for contact-rich manipulation. We propose the Geometric Action Model (GAM), a language-conditioned manipulation policy that directly repurposes a pretrained geometric foundation model (GFM) as a shared substrate for perception, temporal prediction, and action decoding. GAM splits the GFM at an intermediate layer: the shallow layers serve as an observation encoder, and a causal future predictor inserted at the split layer forecasts future latent tokens conditioned on language, proprioception, and action history. The predicted future tokens are then routed through the remaining GFM blocks for feature propagation and decoding, allowing a single backbone to produce both future geometry and actions. This design equips the GFM with language-conditioned temporal world modeling through minimal architectural modification while preserving its rich geometric priors. Across a broad suite of simulation and real-robot manipulation benchmarks, GAM is more accurate, more robust, faster, and lighter than current foundation-model-scale baselines.