TGPO：基於時序的接地策略優化用於信號時序邏輯任務

摘要

學習複雜、長時程任務的控制策略是機器人與自主系統領域的核心挑戰。信號時序邏輯（Signal Temporal Logic, STL）提供了一種強大且具表達力的語言來規範此類任務，但其非馬可夫特性及固有的稀疏獎勵使得標準強化學習（Reinforcement Learning, RL）算法難以解決。以往的RL方法僅專注於有限的STL片段或將STL魯棒性分數作為稀疏的終端獎勵。本文中，我們提出了TGPO（Temporal Grounded Policy Optimization，時序基礎策略優化）來解決通用的STL任務。TGPO將STL分解為定時子目標與不變約束，並提供了一個分層框架來應對問題。TGPO的高層組件為這些子目標提出具體的時間分配，而低層的時間條件策略則利用密集的階段性獎勵信號學習達成序列化的子目標。在推理過程中，我們採樣多種時間分配方案，並選擇最有希望的分配讓策略網絡展開解決軌跡。為了促進具有多個子目標的複雜STL任務的高效策略學習，我們利用已學習的評論家通過Metropolis-Hastings採樣引導高層時序搜索，將探索聚焦於時序可行的解決方案上。我們在五個環境中進行了實驗，範圍涵蓋低維導航、操作、無人機及四足機器人運動。在廣泛的STL任務下，TGPO顯著超越了現有的頂尖基準（特別是在高維與長時程案例中），相比最佳基準，任務成功率平均提升了31.6%。代碼將公開於https://github.com/mengyuest/TGPO。

English

Learning control policies for complex, long-horizon tasks is a central challenge in robotics and autonomous systems. Signal Temporal Logic (STL) offers a powerful and expressive language for specifying such tasks, but its non-Markovian nature and inherent sparse reward make it difficult to be solved via standard Reinforcement Learning (RL) algorithms. Prior RL approaches focus only on limited STL fragments or use STL robustness scores as sparse terminal rewards. In this paper, we propose TGPO, Temporal Grounded Policy Optimization, to solve general STL tasks. TGPO decomposes STL into timed subgoals and invariant constraints and provides a hierarchical framework to tackle the problem. The high-level component of TGPO proposes concrete time allocations for these subgoals, and the low-level time-conditioned policy learns to achieve the sequenced subgoals using a dense, stage-wise reward signal. During inference, we sample various time allocations and select the most promising assignment for the policy network to rollout the solution trajectory. To foster efficient policy learning for complex STL with multiple subgoals, we leverage the learned critic to guide the high-level temporal search via Metropolis-Hastings sampling, focusing exploration on temporally feasible solutions. We conduct experiments on five environments, ranging from low-dimensional navigation to manipulation, drone, and quadrupedal locomotion. Under a wide range of STL tasks, TGPO significantly outperforms state-of-the-art baselines (especially for high-dimensional and long-horizon cases), with an average of 31.6% improvement in task success rate compared to the best baseline. The code will be available at https://github.com/mengyuest/TGPO

TGPO：基於時序的接地策略優化用於信號時序邏輯任務

TGPO: Temporal Grounded Policy Optimization for Signal Temporal Logic Tasks

摘要

Support