具身-R1：強化型具身推理在通用機器人操作中的應用

摘要

在具身人工智慧中，泛化能力受到「視覺到行動差距」的阻礙，這一差距源於數據稀缺性和具身異質性。為解決這一問題，我們率先提出「指向」作為一種統一且與具身無關的中間表示，定義了四種核心的具身指向能力，這些能力將高層次的視覺語言理解與低層次的行動原語相連接。我們引入了Embodied-R1，這是一個專門為具身推理和指向設計的30億參數視覺語言模型（VLM）。我們利用多種具身和通用視覺推理數據集作為來源，構建了一個大規模數據集Embodied-Points-200K，該數據集支持關鍵的具身指向能力。隨後，我們採用兩階段強化微調（RFT）課程，結合專門設計的多任務獎勵機制，對Embodied-R1進行訓練。Embodied-R1在11個具身空間和指向基準測試中達到了最先進的性能。重要的是，它展示了強大的零樣本泛化能力，在SIMPLEREnv中取得了56.2%的成功率，並在8個真實世界的XArm任務中達到了87.5%的成功率，且無需任何任務特定的微調，這比強基線模型提升了62%。此外，該模型對多種視覺干擾表現出高度的魯棒性。我們的工作表明，以指向為中心的表示，結合RFT訓練範式，為縮小機器人中的感知-行動差距提供了一條有效且可泛化的途徑。

English

Generalization in embodied AI is hindered by the "seeing-to-doing gap," which stems from data scarcity and embodiment heterogeneity. To address this, we pioneer "pointing" as a unified, embodiment-agnostic intermediate representation, defining four core embodied pointing abilities that bridge high-level vision-language comprehension with low-level action primitives. We introduce Embodied-R1, a 3B Vision-Language Model (VLM) specifically designed for embodied reasoning and pointing. We use a wide range of embodied and general visual reasoning datasets as sources to construct a large-scale dataset, Embodied-Points-200K, which supports key embodied pointing capabilities. We then train Embodied-R1 using a two-stage Reinforced Fine-tuning (RFT) curriculum with a specialized multi-task reward design. Embodied-R1 achieves state-of-the-art performance on 11 embodied spatial and pointing benchmarks. Critically, it demonstrates robust zero-shot generalization by achieving a 56.2% success rate in the SIMPLEREnv and 87.5% across 8 real-world XArm tasks without any task-specific fine-tuning, representing a 62% improvement over strong baselines. Furthermore, the model exhibits high robustness against diverse visual disturbances. Our work shows that a pointing-centric representation, combined with an RFT training paradigm, offers an effective and generalizable pathway to closing the perception-action gap in robotics.

具身-R1：強化型具身推理在通用機器人操作中的應用

Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation

摘要

Support