具身-R1:強化型具身推理在通用機器人操作中的應用
Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation
August 19, 2025
作者: Yifu Yuan, Haiqin Cui, Yaoting Huang, Yibin Chen, Fei Ni, Zibin Dong, Pengyi Li, Yan Zheng, Jianye Hao
cs.AI
摘要
在具身人工智慧中,泛化能力受到「視覺到行動差距」的阻礙,這一差距源於數據稀缺性和具身異質性。為解決這一問題,我們率先提出「指向」作為一種統一且與具身無關的中間表示,定義了四種核心的具身指向能力,這些能力將高層次的視覺語言理解與低層次的行動原語相連接。我們引入了Embodied-R1,這是一個專門為具身推理和指向設計的30億參數視覺語言模型(VLM)。我們利用多種具身和通用視覺推理數據集作為來源,構建了一個大規模數據集Embodied-Points-200K,該數據集支持關鍵的具身指向能力。隨後,我們採用兩階段強化微調(RFT)課程,結合專門設計的多任務獎勵機制,對Embodied-R1進行訓練。Embodied-R1在11個具身空間和指向基準測試中達到了最先進的性能。重要的是,它展示了強大的零樣本泛化能力,在SIMPLEREnv中取得了56.2%的成功率,並在8個真實世界的XArm任務中達到了87.5%的成功率,且無需任何任務特定的微調,這比強基線模型提升了62%。此外,該模型對多種視覺干擾表現出高度的魯棒性。我們的工作表明,以指向為中心的表示,結合RFT訓練範式,為縮小機器人中的感知-行動差距提供了一條有效且可泛化的途徑。
English
Generalization in embodied AI is hindered by the "seeing-to-doing gap," which
stems from data scarcity and embodiment heterogeneity. To address this, we
pioneer "pointing" as a unified, embodiment-agnostic intermediate
representation, defining four core embodied pointing abilities that bridge
high-level vision-language comprehension with low-level action primitives. We
introduce Embodied-R1, a 3B Vision-Language Model (VLM) specifically designed
for embodied reasoning and pointing. We use a wide range of embodied and
general visual reasoning datasets as sources to construct a large-scale
dataset, Embodied-Points-200K, which supports key embodied pointing
capabilities. We then train Embodied-R1 using a two-stage Reinforced
Fine-tuning (RFT) curriculum with a specialized multi-task reward design.
Embodied-R1 achieves state-of-the-art performance on 11 embodied spatial and
pointing benchmarks. Critically, it demonstrates robust zero-shot
generalization by achieving a 56.2% success rate in the SIMPLEREnv and 87.5%
across 8 real-world XArm tasks without any task-specific fine-tuning,
representing a 62% improvement over strong baselines. Furthermore, the model
exhibits high robustness against diverse visual disturbances. Our work shows
that a pointing-centric representation, combined with an RFT training paradigm,
offers an effective and generalizable pathway to closing the perception-action
gap in robotics.