ChatPaper.aiChatPaper

具身-R1:強化型具身推理在通用機器人操作中的應用

Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation

August 19, 2025
作者: Yifu Yuan, Haiqin Cui, Yaoting Huang, Yibin Chen, Fei Ni, Zibin Dong, Pengyi Li, Yan Zheng, Jianye Hao
cs.AI

摘要

在具身人工智慧中,泛化能力受到「視覺到行動差距」的阻礙,這一差距源於數據稀缺性和具身異質性。為解決這一問題,我們率先提出「指向」作為一種統一且與具身無關的中間表示,定義了四種核心的具身指向能力,這些能力將高層次的視覺語言理解與低層次的行動原語相連接。我們引入了Embodied-R1,這是一個專門為具身推理和指向設計的30億參數視覺語言模型(VLM)。我們利用多種具身和通用視覺推理數據集作為來源,構建了一個大規模數據集Embodied-Points-200K,該數據集支持關鍵的具身指向能力。隨後,我們採用兩階段強化微調(RFT)課程,結合專門設計的多任務獎勵機制,對Embodied-R1進行訓練。Embodied-R1在11個具身空間和指向基準測試中達到了最先進的性能。重要的是,它展示了強大的零樣本泛化能力,在SIMPLEREnv中取得了56.2%的成功率,並在8個真實世界的XArm任務中達到了87.5%的成功率,且無需任何任務特定的微調,這比強基線模型提升了62%。此外,該模型對多種視覺干擾表現出高度的魯棒性。我們的工作表明,以指向為中心的表示,結合RFT訓練範式,為縮小機器人中的感知-行動差距提供了一條有效且可泛化的途徑。
English
Generalization in embodied AI is hindered by the "seeing-to-doing gap," which stems from data scarcity and embodiment heterogeneity. To address this, we pioneer "pointing" as a unified, embodiment-agnostic intermediate representation, defining four core embodied pointing abilities that bridge high-level vision-language comprehension with low-level action primitives. We introduce Embodied-R1, a 3B Vision-Language Model (VLM) specifically designed for embodied reasoning and pointing. We use a wide range of embodied and general visual reasoning datasets as sources to construct a large-scale dataset, Embodied-Points-200K, which supports key embodied pointing capabilities. We then train Embodied-R1 using a two-stage Reinforced Fine-tuning (RFT) curriculum with a specialized multi-task reward design. Embodied-R1 achieves state-of-the-art performance on 11 embodied spatial and pointing benchmarks. Critically, it demonstrates robust zero-shot generalization by achieving a 56.2% success rate in the SIMPLEREnv and 87.5% across 8 real-world XArm tasks without any task-specific fine-tuning, representing a 62% improvement over strong baselines. Furthermore, the model exhibits high robustness against diverse visual disturbances. Our work shows that a pointing-centric representation, combined with an RFT training paradigm, offers an effective and generalizable pathway to closing the perception-action gap in robotics.
PDF91August 20, 2025