通过动作雅可比惩罚学习平滑时变线性策略

摘要

強化學習為學習控制策略提供了一個框架，這些策略能為模擬角色生成多樣化動作。然而此類策略常利用人類或實體機器人無法實現的非自然高頻信號，使其難以真實反映現實世界行為。現有研究通過添加懲罰動作隨時間劇烈變化的獎勵項來解決此問題，但該項往往需要大量調參工作。我們提出使用動作雅可比懲罰項，通過自動微分直接懲罰動作相對於模擬狀態變化的劇烈波動，從而無需針對特定任務調參即可有效消除不現實的高頻控制信號。儘管有效，該懲罰項在傳統全連接神經網絡架構下會帶來顯著計算開銷。為此，我們引入稱為線性策略網絡的新架構，可大幅降低訓練期間計算動作雅可比懲罰項的負擔。此外，線性策略網絡無需參數調優，相比基準方法具有更快的學習收斂速度，且在推理階段的查詢效率優於全連接神經網絡。我們證明，結合動作雅可比懲罰項的線性策略網絡能夠在解決多種特徵各異的動作模仿任務（包括後空翻等動態動作及各種高難度跑酷技能）的同時，生成平滑的控制信號。最後，我們將此方法應用於為配備機械臂的四足機器人創建動態動作控制策略。

English

Reinforcement learning provides a framework for learning control policies that can reproduce diverse motions for simulated characters. However, such policies often exploit unnatural high-frequency signals that are unachievable by humans or physical robots, making them poor representations of real-world behaviors. Existing work addresses this issue by adding a reward term that penalizes a large change in actions over time. This term often requires substantial tuning efforts. We propose to use the action Jacobian penalty, which penalizes changes in action with respect to the changes in simulated state directly through auto differentiation. This effectively eliminates unrealistic high-frequency control signals without task specific tuning. While effective, the action Jacobian penalty introduces significant computational overhead when used with traditional fully connected neural network architectures. To mitigate this, we introduce a new architecture called a Linear Policy Net (LPN) that significantly reduces the computational burden for calculating the action Jacobian penalty during training. In addition, a LPN requires no parameter tuning, exhibits faster learning convergence compared to baseline methods, and can be more efficiently queried during inference time compared to a fully connected neural network. We demonstrate that a Linear Policy Net, combined with the action Jacobian penalty, is able to learn policies that generate smooth signals while solving a number of motion imitation tasks with different characteristics, including dynamic motions such as a backflip and various challenging parkour skills. Finally, we apply this approach to create policies for dynamic motions on a physical quadrupedal robot equipped with an arm.

通过动作雅可比惩罚学习平滑时变线性策略

Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty

摘要

Support