学习H-Infinity运动控制
Learning H-Infinity Locomotion Control
April 22, 2024
作者: Junfeng Long, Wenye Yu, Quanyi Li, Zirui Wang, Dahua Lin, Jiangmiao Pang
cs.AI
摘要
在陡峭环境中稳定行走是四足机器人的基本能力,要求其具备抵抗各种外部干扰的能力。然而,最近基于学习的策略仅使用基本领域随机化来提高学习策略的鲁棒性,这并不能保证机器人具有足够的干扰抵抗能力。本文提出将学习过程建模为演员与新引入的干扰者之间的对抗交互,并通过 H_{infty} 约束确保它们的优化。与最大化折扣总体奖励的演员相反,干扰者负责产生有效的外部力,并通过最大化任务奖励与其预设值之间的误差,即每次迭代中的“成本”来进行优化。为了保持演员和干扰者之间的联合优化稳定,我们的 H_{infty} 约束规定了成本与外部力强度之间比率的界限。通过训练阶段的相互作用,演员可以获得应对日益复杂物理干扰的能力。我们在 Unitree Aliengo 机器人上验证了我们方法的鲁棒性,还在 Unitree A1 机器人上进行了更具挑战性的任务验证,其中四足机器人被期望仅依靠后腿进行行走,就像是双足机器人一样。模拟的定量结果显示相对基线的改进,展示了该方法及每个设计选择的有效性。另一方面,真实机器人实验在各种地形,包括楼梯、高平台、坡道和湿滑地形上干扰时,定性展示了策略的鲁棒性。所有代码、检查点和实际部署指南将公开发布。
English
Stable locomotion in precipitous environments is an essential capability of
quadruped robots, demanding the ability to resist various external
disturbances. However, recent learning-based policies only use basic domain
randomization to improve the robustness of learned policies, which cannot
guarantee that the robot has adequate disturbance resistance capabilities. In
this paper, we propose to model the learning process as an adversarial
interaction between the actor and a newly introduced disturber and ensure their
optimization with H_{infty} constraint. In contrast to the actor that
maximizes the discounted overall reward, the disturber is responsible for
generating effective external forces and is optimized by maximizing the error
between the task reward and its oracle, i.e., "cost" in each iteration. To keep
joint optimization between the actor and the disturber stable, our H_{infty}
constraint mandates the bound of ratio between the cost to the intensity of the
external forces. Through reciprocal interaction throughout the training phase,
the actor can acquire the capability to navigate increasingly complex physical
disturbances. We verify the robustness of our approach on quadrupedal
locomotion tasks with Unitree Aliengo robot, and also a more challenging task
with Unitree A1 robot, where the quadruped is expected to perform locomotion
merely on its hind legs as if it is a bipedal robot. The simulated quantitative
results show improvement against baselines, demonstrating the effectiveness of
the method and each design choice. On the other hand, real-robot experiments
qualitatively exhibit how robust the policy is when interfering with various
disturbances on various terrains, including stairs, high platforms, slopes, and
slippery terrains. All code, checkpoints, and real-world deployment guidance
will be made public.Summary
AI-Generated Summary