行動ヤコビアンペナルティを用いた滑らかな時変線形ポリシーの学習

要旨

強化学習は、シミュレーションキャラクターの多様な動作を再現する制御ポリシーを学習するための枠組みを提供する。しかし、そのようなポリシーは、人間や物理的なロボットでは実現不可能な不自然な高周波信号を利用することが多く、現実世界の行動の適切な表現とはなりえない。既存研究では、この問題に対処するため、時間経過に伴うアクションの大きな変化をペナルティとする報酬項を追加する手法が取られてきた。この項には、多くの場合、大幅な調整作業が要求される。本研究では、自動微分を通じて、シミュレーション状態の変化に対するアクションの変化を直接ペナルティ化する、アクションJacobianペナルティを提案する。これにより、タスク固有の調整なしで、非現実的な高周波制御信号を効果的に排除できる。アクションJacobianペナルティは有効である一方、従来の全結合ニューラルネットワークアーキテクチャと併用すると、計算コストが大幅に増加する。この問題を軽減するため、学習時のアクションJacobianペナルティ計算の計算負荷を大幅に削減するLinear Policy Net (LPN) と呼ばれる新たなアーキテクチャを提案する。さらに、LPNはパラメータ調整が不要であり、ベースライン手法と比較して学習の収束が速く、推論時には全結合ニューラルネットワークよりも効率的に問い合わせることができる。我々は、Linear Policy NetをアクションJacobianペナルティと組み合わせることで、バックフリップのような動的動作や様々な挑戦的なパルクール技能を含む、異なる特性を持つ数々の動作模倣タスクを解決しつつ、滑らかな信号を生成するポリシーを学習できることを実証する。最後に、このアプローチを応用し、アームを装備した物理的な四足歩行ロボットにおける動的動作のためのポリシーを作成する。

English

Reinforcement learning provides a framework for learning control policies that can reproduce diverse motions for simulated characters. However, such policies often exploit unnatural high-frequency signals that are unachievable by humans or physical robots, making them poor representations of real-world behaviors. Existing work addresses this issue by adding a reward term that penalizes a large change in actions over time. This term often requires substantial tuning efforts. We propose to use the action Jacobian penalty, which penalizes changes in action with respect to the changes in simulated state directly through auto differentiation. This effectively eliminates unrealistic high-frequency control signals without task specific tuning. While effective, the action Jacobian penalty introduces significant computational overhead when used with traditional fully connected neural network architectures. To mitigate this, we introduce a new architecture called a Linear Policy Net (LPN) that significantly reduces the computational burden for calculating the action Jacobian penalty during training. In addition, a LPN requires no parameter tuning, exhibits faster learning convergence compared to baseline methods, and can be more efficiently queried during inference time compared to a fully connected neural network. We demonstrate that a Linear Policy Net, combined with the action Jacobian penalty, is able to learn policies that generate smooth signals while solving a number of motion imitation tasks with different characteristics, including dynamic motions such as a backflip and various challenging parkour skills. Finally, we apply this approach to create policies for dynamic motions on a physical quadrupedal robot equipped with an arm.

行動ヤコビアンペナルティを用いた滑らかな時変線形ポリシーの学習

Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty

要旨

Support