学习具有动作雅可比惩罚的平滑时变线性策略

摘要

强化学习为学习控制策略提供了一个框架，能够生成模拟角色的多样化运动。然而，此类策略常利用人类或物理机器人无法实现的高频非自然信号，导致其难以反映真实世界行为。现有研究通过添加惩罚动作时序大幅变化的奖励项来解决该问题，但这类方法通常需要大量调参工作。我们提出采用动作雅可比惩罚项，通过自动微分直接对模拟状态变化引起的动作变化进行惩罚。该方法无需任务特定调参即可有效消除不现实的高频控制信号。虽然动作雅可比惩罚项效果显著，但在传统全连接神经网络架构下会引入大量计算开销。为此，我们提出名为线性策略网络的新型架构，可大幅降低训练过程中计算动作雅可比惩罚项的计算负担。此外，线性策略网络无需参数调优，相比基线方法具有更快的学习收敛速度，且在推理时的查询效率优于全连接神经网络。实验表明，结合动作雅可比惩罚项的线性策略网络能够学习生成平滑信号的控制策略，成功解决包括后空翻等动态运动及多种高难度跑酷技能在内的各类运动模仿任务。最后，我们将该方法应用于为配备机械臂的四足物理机器人创建动态运动控制策略。

English

Reinforcement learning provides a framework for learning control policies that can reproduce diverse motions for simulated characters. However, such policies often exploit unnatural high-frequency signals that are unachievable by humans or physical robots, making them poor representations of real-world behaviors. Existing work addresses this issue by adding a reward term that penalizes a large change in actions over time. This term often requires substantial tuning efforts. We propose to use the action Jacobian penalty, which penalizes changes in action with respect to the changes in simulated state directly through auto differentiation. This effectively eliminates unrealistic high-frequency control signals without task specific tuning. While effective, the action Jacobian penalty introduces significant computational overhead when used with traditional fully connected neural network architectures. To mitigate this, we introduce a new architecture called a Linear Policy Net (LPN) that significantly reduces the computational burden for calculating the action Jacobian penalty during training. In addition, a LPN requires no parameter tuning, exhibits faster learning convergence compared to baseline methods, and can be more efficiently queried during inference time compared to a fully connected neural network. We demonstrate that a Linear Policy Net, combined with the action Jacobian penalty, is able to learn policies that generate smooth signals while solving a number of motion imitation tasks with different characteristics, including dynamic motions such as a backflip and various challenging parkour skills. Finally, we apply this approach to create policies for dynamic motions on a physical quadrupedal robot equipped with an arm.

学习具有动作雅可比惩罚的平滑时变线性策略

Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty

摘要

Support