安全流Q学习：基于可达性流策略的离线安全强化学习

摘要

離線安全強化學習旨在從靜態數據集中，在嚴格安全約束下尋找獎勵最大化策略。現有方法通常依賴軟性期望成本目標或迭代式生成推斷，這對安全關鍵的實時控制而言可能不足。我們提出安全流Q學習（SafeFQL），通過將漢密爾頓-雅可比可達性啟發的安全值函數與高效單步流策略相結合，將FQL擴展至離線安全強化學習。SafeFQL通過自洽貝爾曼遞歸學習安全值，通過行為克隆訓練流策略，並將其提煉為單步行動器，在部署時無需拒絕採樣即可實現獎勵最大化的安全動作選擇。為應對學習安全邊界中有限數據近似誤差，我們增加了保形預測校準步驟，可調整安全閾值並提供有限樣本概率安全覆蓋。實證表明，相比擴散式安全生成基線方法，SafeFQL以適度增加的離線訓練成本換取推理延遲的大幅降低，這對實時安全關鍵部署具有優勢。在船舶導航與Safety Gymnasium MuJoCo任務中，SafeFQL在保持或超越現有離線安全強化學習性能的同時，顯著降低了約束違反次數。

English

Offline safe reinforcement learning (RL) seeks reward-maximizing policies from static datasets under strict safety constraints. Existing methods often rely on soft expected-cost objectives or iterative generative inference, which can be insufficient for safety-critical real-time control. We propose Safe Flow Q-Learning (SafeFQL), which extends FQL to safe offline RL by combining a Hamilton--Jacobi reachability-inspired safety value function with an efficient one-step flow policy. SafeFQL learns the safety value via a self-consistency Bellman recursion, trains a flow policy by behavioral cloning, and distills it into a one-step actor for reward-maximizing safe action selection without rejection sampling at deployment. To account for finite-data approximation error in the learned safety boundary, we add a conformal prediction calibration step that adjusts the safety threshold and provides finite-sample probabilistic safety coverage. Empirically, SafeFQL trades modestly higher offline training cost for substantially lower inference latency than diffusion-style safe generative baselines, which is advantageous for real-time safety-critical deployment. Across boat navigation, and Safety Gymnasium MuJoCo tasks, SafeFQL matches or exceeds prior offline safe RL performance while substantially reducing constraint violations.

安全流Q学习：基于可达性流策略的离线安全强化学习

Safe Flow Q-Learning: Offline Safe Reinforcement Learning with Reachability-Based Flow Policies

摘要

Support