语言到奖励的机器人技能综合
Language to Rewards for Robotic Skill Synthesis
June 14, 2023
作者: Wenhao Yu, Nimrod Gileadi, Chuyuan Fu, Sean Kirmani, Kuang-Huei Lee, Montse Gonzalez Arenas, Hao-Tien Lewis Chiang, Tom Erez, Leonard Hasenclever, Jan Humplik, Brian Ichter, Ted Xiao, Peng Xu, Andy Zeng, Tingnan Zhang, Nicolas Heess, Dorsa Sadigh, Jie Tan, Yuval Tassa, Fei Xia
cs.AI
摘要
大型语言模型(LLMs)已经展示了在通过上下文学习获得多样新能力方面取得的令人兴奋的进展,涵盖了从逻辑推理到编写代码的各种能力。机器人学研究人员也探索了利用LLMs来提升机器人控制的能力。然而,由于低级别机器人动作依赖硬件且在LLM训练语料库中占比较少,目前将LLMs应用于机器人学的努力主要将LLMs视为语义规划器,或依赖人工设计的控制基元与机器人进行接口。另一方面,奖励函数被证明是灵活的表示形式,可以被优化为控制策略以实现多样任务,而它们的语义丰富性使其适合由LLMs指定。在这项工作中,我们引入了一种新范式,利用这一认识,通过利用LLMs定义可以被优化并完成各种机器人任务的奖励参数。通过将奖励作为LLMs生成的中间接口,我们可以有效地弥合高级语言指令或更正与低级机器人动作之间的差距。同时,结合实时优化器MuJoCo MPC,赋予了一种交互式行为创建体验,用户可以立即观察结果并向系统提供反馈。为了系统评估我们提出的方法的性能,我们为模拟四足机器人和灵巧机械手机器人设计了共计17项任务。我们展示了我们提出的方法可可靠地解决90%的设计任务,而使用基于原始技能作为与Code-as-policies接口的基线则实现了50%的任务。我们进一步在一个真实机器人臂上验证了我们的方法,通过我们的交互式系统出现了非抓取推动等复杂操纵技能。
English
Large language models (LLMs) have demonstrated exciting progress in acquiring
diverse new capabilities through in-context learning, ranging from logical
reasoning to code-writing. Robotics researchers have also explored using LLMs
to advance the capabilities of robotic control. However, since low-level robot
actions are hardware-dependent and underrepresented in LLM training corpora,
existing efforts in applying LLMs to robotics have largely treated LLMs as
semantic planners or relied on human-engineered control primitives to interface
with the robot. On the other hand, reward functions are shown to be flexible
representations that can be optimized for control policies to achieve diverse
tasks, while their semantic richness makes them suitable to be specified by
LLMs. In this work, we introduce a new paradigm that harnesses this realization
by utilizing LLMs to define reward parameters that can be optimized and
accomplish variety of robotic tasks. Using reward as the intermediate interface
generated by LLMs, we can effectively bridge the gap between high-level
language instructions or corrections to low-level robot actions. Meanwhile,
combining this with a real-time optimizer, MuJoCo MPC, empowers an interactive
behavior creation experience where users can immediately observe the results
and provide feedback to the system. To systematically evaluate the performance
of our proposed method, we designed a total of 17 tasks for a simulated
quadruped robot and a dexterous manipulator robot. We demonstrate that our
proposed method reliably tackles 90% of the designed tasks, while a baseline
using primitive skills as the interface with Code-as-policies achieves 50% of
the tasks. We further validated our method on a real robot arm where complex
manipulation skills such as non-prehensile pushing emerge through our
interactive system.