TGPO：面向信号时序逻辑任务的时序基础策略优化

摘要

学习复杂、长期任务的控制策略是机器人与自主系统领域的一项核心挑战。信号时序逻辑（STL）为此类任务提供了一种强大且表达丰富的规范语言，但其非马尔可夫特性及固有的稀疏奖励使得标准强化学习（RL）算法难以直接求解。以往的RL方法仅局限于处理有限的STL片段，或将STL鲁棒性评分作为稀疏的终端奖励。本文提出TGPO（时序基础策略优化），旨在解决一般性的STL任务。TGPO将STL分解为时序子目标与不变约束，并构建了一个层次化框架来应对这一问题。TGPO的高层组件为这些子目标分配具体的时间，而低层的时间条件策略则利用密集的阶段奖励信号学习达成序列化的子目标。在推理阶段，我们采样多种时间分配方案，并选择最有潜力的分配让策略网络展开解决方案轨迹。为促进包含多个子目标的复杂STL任务的高效策略学习，我们利用已学习的评论家通过Metropolis-Hastings采样指导高层时序搜索，将探索集中于时间上可行的解决方案。我们在五个环境中进行了实验，涵盖低维导航、操作、无人机及四足机器人运动。在广泛的STL任务下，TGPO显著超越了现有最先进的基线方法（特别是在高维与长期任务场景中），任务成功率平均提升了31.6%。代码将发布于https://github.com/mengyuest/TGPO。

English

Learning control policies for complex, long-horizon tasks is a central challenge in robotics and autonomous systems. Signal Temporal Logic (STL) offers a powerful and expressive language for specifying such tasks, but its non-Markovian nature and inherent sparse reward make it difficult to be solved via standard Reinforcement Learning (RL) algorithms. Prior RL approaches focus only on limited STL fragments or use STL robustness scores as sparse terminal rewards. In this paper, we propose TGPO, Temporal Grounded Policy Optimization, to solve general STL tasks. TGPO decomposes STL into timed subgoals and invariant constraints and provides a hierarchical framework to tackle the problem. The high-level component of TGPO proposes concrete time allocations for these subgoals, and the low-level time-conditioned policy learns to achieve the sequenced subgoals using a dense, stage-wise reward signal. During inference, we sample various time allocations and select the most promising assignment for the policy network to rollout the solution trajectory. To foster efficient policy learning for complex STL with multiple subgoals, we leverage the learned critic to guide the high-level temporal search via Metropolis-Hastings sampling, focusing exploration on temporally feasible solutions. We conduct experiments on five environments, ranging from low-dimensional navigation to manipulation, drone, and quadrupedal locomotion. Under a wide range of STL tasks, TGPO significantly outperforms state-of-the-art baselines (especially for high-dimensional and long-horizon cases), with an average of 31.6% improvement in task success rate compared to the best baseline. The code will be available at https://github.com/mengyuest/TGPO

TGPO：面向信号时序逻辑任务的时序基础策略优化

TGPO: Temporal Grounded Policy Optimization for Signal Temporal Logic Tasks

摘要

Support