구현된-R1: 일반 로봇 조작을 위한 강화된 구현적 추론

초록

구현된 AI에서의 일반화는 데이터 부족과 구현체의 이질성으로 인한 "보기에서 행동으로의 간극(seeing-to-doing gap)"에 의해 방해받습니다. 이를 해결하기 위해 우리는 "포인팅(pointing)"을 통합적이고 구현체에 구애받지 않는 중간 표현으로 제안하며, 고수준의 시각-언어 이해와 저수준의 행동 기본 요소를 연결하는 네 가지 핵심 구현체 포인팅 능력을 정의합니다. 우리는 구현체 추론과 포인팅을 위해 특별히 설계된 30억 파라미터 규모의 시각-언어 모델(VLM)인 Embodied-R1을 소개합니다. 다양한 구현체 및 일반 시각 추론 데이터셋을 활용하여 대규모 데이터셋인 Embodied-Points-200K를 구축하였으며, 이는 핵심 구현체 포인팅 능력을 지원합니다. 이후, 특화된 다중 작업 보상 설계와 함께 두 단계의 강화 미세 조정(Reinforced Fine-tuning, RFT) 커리큘럼을 사용해 Embodied-R1을 학습시킵니다. Embodied-R1은 11개의 구현체 공간 및 포인팅 벤치마크에서 최첨단 성능을 달성했습니다. 특히, SIMPLEREnv에서 56.2%의 성공률과 8개의 실제 XArm 작업에서 87.5%의 성공률을 기록하며, 어떠한 작업별 미세 조정 없이도 강력한 제로샷 일반화 능력을 입증했습니다. 이는 강력한 베이스라인 대비 62%의 개선을 나타냅니다. 또한, 이 모델은 다양한 시각적 방해 요인에 대해 높은 견고성을 보였습니다. 우리의 연구는 포인팅 중심 표현과 RFT 학습 패러다임의 결합이 로봇 공학에서의 지각-행동 간극을 해소하는 효과적이고 일반화 가능한 접근 방식을 제공함을 보여줍니다.

English

Generalization in embodied AI is hindered by the "seeing-to-doing gap," which stems from data scarcity and embodiment heterogeneity. To address this, we pioneer "pointing" as a unified, embodiment-agnostic intermediate representation, defining four core embodied pointing abilities that bridge high-level vision-language comprehension with low-level action primitives. We introduce Embodied-R1, a 3B Vision-Language Model (VLM) specifically designed for embodied reasoning and pointing. We use a wide range of embodied and general visual reasoning datasets as sources to construct a large-scale dataset, Embodied-Points-200K, which supports key embodied pointing capabilities. We then train Embodied-R1 using a two-stage Reinforced Fine-tuning (RFT) curriculum with a specialized multi-task reward design. Embodied-R1 achieves state-of-the-art performance on 11 embodied spatial and pointing benchmarks. Critically, it demonstrates robust zero-shot generalization by achieving a 56.2% success rate in the SIMPLEREnv and 87.5% across 8 real-world XArm tasks without any task-specific fine-tuning, representing a 62% improvement over strong baselines. Furthermore, the model exhibits high robustness against diverse visual disturbances. Our work shows that a pointing-centric representation, combined with an RFT training paradigm, offers an effective and generalizable pathway to closing the perception-action gap in robotics.

구현된-R1: 일반 로봇 조작을 위한 강화된 구현적 추론

Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation

초록

Support