TGPO: 信号時相論理タスクのための時間的基盤付きポリシー最適化

要旨

複雑で長期的なタスクに対する制御ポリシーの学習は、ロボティクスと自律システムにおける中心的な課題である。シグナル時相論理（STL）は、そのようなタスクを指定するための強力で表現力豊かな言語を提供するが、その非マルコフ性と本質的なスパース報酬のため、標準的な強化学習（RL）アルゴリズムでは解決が難しい。従来のRLアプローチは、限られたSTLフラグメントに焦点を当てるか、STLのロバストネススコアをスパースな終端報酬として使用するものに留まっていた。本論文では、一般的なSTLタスクを解決するためのTGPO（Temporal Grounded Policy Optimization）を提案する。TGPOは、STLを時間指定されたサブゴールと不変制約に分解し、問題に取り組むための階層的フレームワークを提供する。TGPOの高レベルコンポーネントは、これらのサブゴールに対する具体的な時間割り当てを提案し、低レベルの時間条件付きポリシーは、密な段階的報酬信号を使用して、順序付けられたサブゴールを達成することを学習する。推論時には、様々な時間割り当てをサンプリングし、ポリシーネットワークが解の軌道を展開するために最も有望な割り当てを選択する。複数のサブゴールを持つ複雑なSTLに対する効率的なポリシー学習を促進するため、学習された批評家を活用して、メトロポリス・ヘイスティングスサンプリングを通じて高レベルの時間探索を導き、時間的に実行可能な解に探索を集中させる。低次元ナビゲーションから操作、ドローン、四足歩行まで、5つの環境で実験を行った。幅広いSTLタスクにおいて、TGPOは最先端のベースライン（特に高次元および長期的なケース）を大幅に上回り、最高のベースラインと比較してタスク成功率が平均31.6％向上した。コードはhttps://github.com/mengyuest/TGPOで公開予定である。

English

Learning control policies for complex, long-horizon tasks is a central challenge in robotics and autonomous systems. Signal Temporal Logic (STL) offers a powerful and expressive language for specifying such tasks, but its non-Markovian nature and inherent sparse reward make it difficult to be solved via standard Reinforcement Learning (RL) algorithms. Prior RL approaches focus only on limited STL fragments or use STL robustness scores as sparse terminal rewards. In this paper, we propose TGPO, Temporal Grounded Policy Optimization, to solve general STL tasks. TGPO decomposes STL into timed subgoals and invariant constraints and provides a hierarchical framework to tackle the problem. The high-level component of TGPO proposes concrete time allocations for these subgoals, and the low-level time-conditioned policy learns to achieve the sequenced subgoals using a dense, stage-wise reward signal. During inference, we sample various time allocations and select the most promising assignment for the policy network to rollout the solution trajectory. To foster efficient policy learning for complex STL with multiple subgoals, we leverage the learned critic to guide the high-level temporal search via Metropolis-Hastings sampling, focusing exploration on temporally feasible solutions. We conduct experiments on five environments, ranging from low-dimensional navigation to manipulation, drone, and quadrupedal locomotion. Under a wide range of STL tasks, TGPO significantly outperforms state-of-the-art baselines (especially for high-dimensional and long-horizon cases), with an average of 31.6% improvement in task success rate compared to the best baseline. The code will be available at https://github.com/mengyuest/TGPO

TGPO: 信号時相論理タスクのための時間的基盤付きポリシー最適化

TGPO: Temporal Grounded Policy Optimization for Signal Temporal Logic Tasks

要旨

Support