机器人R1：强化学习赋能机器人具身推理能力提升

摘要

大型视觉语言模型（LVLMs）近期在结合具身推理与机器人控制以推动机器人技术发展方面展现出巨大潜力。一种常见的方法是通过监督微调（SFT）训练与机器人控制相关的具身推理任务。然而，SFT数据集往往基于启发式构建，并未明确针对提升机器人控制进行优化。此外，SFT常导致灾难性遗忘和泛化性能下降等问题。为解决这些局限，我们提出了Robot-R1，一个利用强化学习专门增强机器人控制具身推理能力的新框架。Robot-R1学习预测完成任务所需的下一关键点状态，这一预测基于当前场景图像及从专家演示中提取的环境元数据。受DeepSeek-R1学习方法的启发，Robot-R1采样基于推理的响应，并强化那些能带来更准确预测的响应。实验表明，采用Robot-R1训练的模型在具身推理任务上优于SFT方法。尽管仅有70亿参数，Robot-R1在涉及低级动作控制的推理任务，如空间和基础运动推理上，甚至超越了GPT-4o。

English

Large Vision-Language Models (LVLMs) have recently shown great promise in advancing robotics by combining embodied reasoning with robot control. A common approach involves training on embodied reasoning tasks related to robot control using Supervised Fine-Tuning (SFT). However, SFT datasets are often heuristically constructed and not explicitly optimized for improving robot control. Furthermore, SFT often leads to issues such as catastrophic forgetting and reduced generalization performance. To address these limitations, we introduce Robot-R1, a novel framework that leverages reinforcement learning to enhance embodied reasoning specifically for robot control. Robot-R1 learns to predict the next keypoint state required for task completion, conditioned on the current scene image and environment metadata derived from expert demonstrations. Inspired by the DeepSeek-R1 learning approach, Robot-R1 samples reasoning-based responses and reinforces those that lead to more accurate predictions. Our experiments show that models trained with Robot-R1 outperform SFT methods on embodied reasoning tasks. Despite having only 7B parameters, Robot-R1 even surpasses GPT-4o on reasoning tasks related to low-level action control, such as spatial and primitive movement reasoning.

机器人R1：强化学习赋能机器人具身推理能力提升

Robot-R1: Reinforcement Learning for Enhanced Embodied Reasoning in Robotics

摘要

Support