学习H-Infinity运动控制

摘要

在陡峭环境中稳定行走是四足机器人的基本能力，要求其具备抵抗各种外部干扰的能力。然而，最近基于学习的策略仅使用基本领域随机化来提高学习策略的鲁棒性，这并不能保证机器人具有足够的干扰抵抗能力。本文提出将学习过程建模为演员与新引入的干扰者之间的对抗交互，并通过 H_{infty} 约束确保它们的优化。与最大化折扣总体奖励的演员相反，干扰者负责产生有效的外部力，并通过最大化任务奖励与其预设值之间的误差，即每次迭代中的“成本”来进行优化。为了保持演员和干扰者之间的联合优化稳定，我们的 H_{infty} 约束规定了成本与外部力强度之间比率的界限。通过训练阶段的相互作用，演员可以获得应对日益复杂物理干扰的能力。我们在 Unitree Aliengo 机器人上验证了我们方法的鲁棒性，还在 Unitree A1 机器人上进行了更具挑战性的任务验证，其中四足机器人被期望仅依靠后腿进行行走，就像是双足机器人一样。模拟的定量结果显示相对基线的改进，展示了该方法及每个设计选择的有效性。另一方面，真实机器人实验在各种地形，包括楼梯、高平台、坡道和湿滑地形上干扰时，定性展示了策略的鲁棒性。所有代码、检查点和实际部署指南将公开发布。

English

Stable locomotion in precipitous environments is an essential capability of quadruped robots, demanding the ability to resist various external disturbances. However, recent learning-based policies only use basic domain randomization to improve the robustness of learned policies, which cannot guarantee that the robot has adequate disturbance resistance capabilities. In this paper, we propose to model the learning process as an adversarial interaction between the actor and a newly introduced disturber and ensure their optimization with H_{infty} constraint. In contrast to the actor that maximizes the discounted overall reward, the disturber is responsible for generating effective external forces and is optimized by maximizing the error between the task reward and its oracle, i.e., "cost" in each iteration. To keep joint optimization between the actor and the disturber stable, our H_{infty} constraint mandates the bound of ratio between the cost to the intensity of the external forces. Through reciprocal interaction throughout the training phase, the actor can acquire the capability to navigate increasingly complex physical disturbances. We verify the robustness of our approach on quadrupedal locomotion tasks with Unitree Aliengo robot, and also a more challenging task with Unitree A1 robot, where the quadruped is expected to perform locomotion merely on its hind legs as if it is a bipedal robot. The simulated quantitative results show improvement against baselines, demonstrating the effectiveness of the method and each design choice. On the other hand, real-robot experiments qualitatively exhibit how robust the policy is when interfering with various disturbances on various terrains, including stairs, high platforms, slopes, and slippery terrains. All code, checkpoints, and real-world deployment guidance will be made public.

学习H-Infinity运动控制

Learning H-Infinity Locomotion Control

摘要

Support