액션 자코비안 패널티를 활용한 부드러운 시변 선형 정책 학습

초록

강화 학습은 시뮬레이션 캐릭터의 다양한 동작을 재현할 수 있는 제어 정책을 학습하기 위한 프레임워크를 제공합니다. 그러나 이러한 정책은 종종 인간이나 물리적 로봇이 구현할 수 없는 비정상적인 고주파 신호를 활용하여 실제 세계의 행동을 제대로 표현하지 못하는 경우가 많습니다. 기존 연구에서는 시간에 따른 행동 변화가 클 경우 패널티를 부과하는 보상 항목을 추가하여 이 문제를 해결합니다. 이러한 항목은 종종 상당한 튜닝 노력이 필요합니다. 우리는 자동 미분을 통해 시뮬레이션 상태 변화에 대한 행동 변화를 직접적으로 패널티로 부과하는 액션 야코비안 패널티 사용을 제안합니다. 이는 작업별 튜닝 없이도 비현실적인 고주파 제어 신호를 효과적으로 제거합니다. 액션 야코비안 패널티는 효과적이지만, 기존의 완전 연결 신경망 아키텍처와 함께 사용할 경우 상당한 계산 오버헤드를 발생시킵니다. 이를 완화하기 위해, 우리는 학습 중 액션 야코비안 패널티 계산에 따른 계산 부담을 크게 줄이는 선형 정책 네트워크(LPN)라는 새로운 아키텍처를 소개합니다. 또한 LPN은 매개변수 튜닝이 필요 없으며, 기준 방법론에 비해 더 빠른 학습 수렴을 보여주고, 추론 시 완전 연결 신경망보다 더 효율적으로 실행될 수 있습니다. 우리는 선형 정책 네트워크가 액션 야코비안 패널티와 결합되었을 때, 백플립과 같은 역동적인 동작 및 다양한 파쿠르 기술을 포함한 서로 다른 특성을 가진 여러 모션 모방 작업을 해결하면서 부드러운 신호를 생성하는 정책을 학습할 수 있음을 입증합니다. 마지막으로, 우리는 이 접근법을 적용하여 팔이 장착된 물리적 사족 보행 로봇에서 역동적인 동작을 위한 정책을 생성합니다.

English

Reinforcement learning provides a framework for learning control policies that can reproduce diverse motions for simulated characters. However, such policies often exploit unnatural high-frequency signals that are unachievable by humans or physical robots, making them poor representations of real-world behaviors. Existing work addresses this issue by adding a reward term that penalizes a large change in actions over time. This term often requires substantial tuning efforts. We propose to use the action Jacobian penalty, which penalizes changes in action with respect to the changes in simulated state directly through auto differentiation. This effectively eliminates unrealistic high-frequency control signals without task specific tuning. While effective, the action Jacobian penalty introduces significant computational overhead when used with traditional fully connected neural network architectures. To mitigate this, we introduce a new architecture called a Linear Policy Net (LPN) that significantly reduces the computational burden for calculating the action Jacobian penalty during training. In addition, a LPN requires no parameter tuning, exhibits faster learning convergence compared to baseline methods, and can be more efficiently queried during inference time compared to a fully connected neural network. We demonstrate that a Linear Policy Net, combined with the action Jacobian penalty, is able to learn policies that generate smooth signals while solving a number of motion imitation tasks with different characteristics, including dynamic motions such as a backflip and various challenging parkour skills. Finally, we apply this approach to create policies for dynamic motions on a physical quadrupedal robot equipped with an arm.

액션 자코비안 패널티를 활용한 부드러운 시변 선형 정책 학습

Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty

초록

Support