Embodied-R1: 汎用ロボット操作のための強化型身体化推論

要旨

エンボディドAIにおける汎化能力は、「見ることから行動することのギャップ」によって阻害されており、これはデータ不足とエンボディメントの多様性に起因しています。この問題に対処するため、我々は「ポインティング」を統一されたエンボディメントに依存しない中間表現として初めて提案し、高レベルの視覚言語理解と低レベルの行動プリミティブを橋渡しする4つのコアなエンボディドポインティング能力を定義しました。我々は、エンボディド推論とポインティングに特化して設計された3B規模のVision-Language Model (VLM)であるEmbodied-R1を導入しました。多様なエンボディドおよび一般的な視覚推論データセットをソースとして、大規模なデータセットEmbodied-Points-200Kを構築し、主要なエンボディドポインティング能力をサポートします。その後、専用のマルチタスク報酬設計を用いた2段階のReinforced Fine-tuning (RFT)カリキュラムでEmbodied-R1をトレーニングしました。Embodied-R1は、11のエンボディド空間およびポインティングベンチマークで最先端の性能を達成しました。特に、SIMPLEREnvでは56.2%の成功率、8つの実世界XArmタスクでは87.5%の成功率を達成し、タスク固有のファインチューニングなしで強力なベースラインを62%上回る堅牢なゼロショット汎化能力を示しました。さらに、モデルは多様な視覚的擾乱に対して高いロバスト性を示しました。我々の研究は、ポインティング中心の表現とRFTトレーニングパラダイムを組み合わせることで、ロボティクスにおける知覚と行動のギャップを埋めるための効果的で汎化可能な道筋を提供することを示しています。

English

Generalization in embodied AI is hindered by the "seeing-to-doing gap," which stems from data scarcity and embodiment heterogeneity. To address this, we pioneer "pointing" as a unified, embodiment-agnostic intermediate representation, defining four core embodied pointing abilities that bridge high-level vision-language comprehension with low-level action primitives. We introduce Embodied-R1, a 3B Vision-Language Model (VLM) specifically designed for embodied reasoning and pointing. We use a wide range of embodied and general visual reasoning datasets as sources to construct a large-scale dataset, Embodied-Points-200K, which supports key embodied pointing capabilities. We then train Embodied-R1 using a two-stage Reinforced Fine-tuning (RFT) curriculum with a specialized multi-task reward design. Embodied-R1 achieves state-of-the-art performance on 11 embodied spatial and pointing benchmarks. Critically, it demonstrates robust zero-shot generalization by achieving a 56.2% success rate in the SIMPLEREnv and 87.5% across 8 real-world XArm tasks without any task-specific fine-tuning, representing a 62% improvement over strong baselines. Furthermore, the model exhibits high robustness against diverse visual disturbances. Our work shows that a pointing-centric representation, combined with an RFT training paradigm, offers an effective and generalizable pathway to closing the perception-action gap in robotics.

Embodied-R1: 汎用ロボット操作のための強化型身体化推論

Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation

要旨

Support