學習H-Infinity運動控制
Learning H-Infinity Locomotion Control
April 22, 2024
作者: Junfeng Long, Wenye Yu, Quanyi Li, Zirui Wang, Dahua Lin, Jiangmiao Pang
cs.AI
摘要
在陡峭環境中穩定行走是四足機器人的基本能力,需要具備抵抗各種外部干擾的能力。然而,最近基於學習的策略僅使用基本的領域隨機化來提高學習策略的魯棒性,這並不能保證機器人具有足夠的抗干擾能力。在本文中,我們建議將學習過程建模為演員和新引入的干擾者之間的對抗交互作用,並確保它們在H_{infty}約束下進行優化。與最大化折扣整體獎勵的演員相反,干擾者負責產生有效的外部力量,並通過最大化任務獎勵與其神諭之間的誤差,即每次迭代中的“成本”來進行優化。為了保持演員和干擾者之間的聯合優化穩定,我們的H_{infty}約束規定了成本與外部力量強度之間的比率上限。通過訓練階段的相互作用,演員可以獲得應對日益複雜的物理干擾的能力。我們在Unitree Aliengo機器人上的四足行走任務以及在Unitree A1機器人上的更具挑戰性任務上驗證了我們方法的魯棒性,其中四足機器人被期望僅依靠後腿進行行走,就像是雙足機器人一樣。模擬的定量結果顯示相對於基準線的改善,證明了該方法及每個設計選擇的有效性。另一方面,真實機器人實驗在各種地形,包括樓梯、高平台、斜坡和滑溜地形上干擾時展示了策略的魯棒性。所有代碼、檢查點和現實世界部署指南將公開發布。
English
Stable locomotion in precipitous environments is an essential capability of
quadruped robots, demanding the ability to resist various external
disturbances. However, recent learning-based policies only use basic domain
randomization to improve the robustness of learned policies, which cannot
guarantee that the robot has adequate disturbance resistance capabilities. In
this paper, we propose to model the learning process as an adversarial
interaction between the actor and a newly introduced disturber and ensure their
optimization with H_{infty} constraint. In contrast to the actor that
maximizes the discounted overall reward, the disturber is responsible for
generating effective external forces and is optimized by maximizing the error
between the task reward and its oracle, i.e., "cost" in each iteration. To keep
joint optimization between the actor and the disturber stable, our H_{infty}
constraint mandates the bound of ratio between the cost to the intensity of the
external forces. Through reciprocal interaction throughout the training phase,
the actor can acquire the capability to navigate increasingly complex physical
disturbances. We verify the robustness of our approach on quadrupedal
locomotion tasks with Unitree Aliengo robot, and also a more challenging task
with Unitree A1 robot, where the quadruped is expected to perform locomotion
merely on its hind legs as if it is a bipedal robot. The simulated quantitative
results show improvement against baselines, demonstrating the effectiveness of
the method and each design choice. On the other hand, real-robot experiments
qualitatively exhibit how robust the policy is when interfering with various
disturbances on various terrains, including stairs, high platforms, slopes, and
slippery terrains. All code, checkpoints, and real-world deployment guidance
will be made public.