安全流Q学习:基于可达性流策略的离线安全强化学习
Safe Flow Q-Learning: Offline Safe Reinforcement Learning with Reachability-Based Flow Policies
March 16, 2026
作者: Mumuksh Tayal, Manan Tayal, Ravi Prakash
cs.AI
摘要
离线安全强化学习(RL)旨在从静态数据集中学习严格安全约束下的奖励最大化策略。现有方法通常依赖软期望成本目标或迭代式生成推理,这对安全关键型实时控制而言存在不足。我们提出安全流Q学习(SafeFQL),通过将汉密尔顿-雅可比可达性启发的安全价值函数与高效单步流策略相结合,将FQL扩展至安全离线RL领域。SafeFQL通过自洽性贝尔曼递归学习安全价值函数,通过行为克隆训练流策略,并将其蒸馏为单步执行器,从而在部署时无需拒绝采样即可实现奖励最大化的安全动作选择。针对学习安全边界存在的有限数据近似误差,我们引入共形预测校准步骤,动态调整安全阈值并提供有限样本的概率安全保证。实验表明,与扩散式安全生成基线方法相比,SafeFQL以适度增加的离线训练成本换取了显著降低的推理延迟,这对安全关键型实时部署尤为有利。在船舶导航和Safety Gymnasium MuJoCo任务中,SafeFQL在保持或超越现有离线安全RL性能的同时,显著减少了约束违反情况。
English
Offline safe reinforcement learning (RL) seeks reward-maximizing policies from static datasets under strict safety constraints. Existing methods often rely on soft expected-cost objectives or iterative generative inference, which can be insufficient for safety-critical real-time control. We propose Safe Flow Q-Learning (SafeFQL), which extends FQL to safe offline RL by combining a Hamilton--Jacobi reachability-inspired safety value function with an efficient one-step flow policy. SafeFQL learns the safety value via a self-consistency Bellman recursion, trains a flow policy by behavioral cloning, and distills it into a one-step actor for reward-maximizing safe action selection without rejection sampling at deployment. To account for finite-data approximation error in the learned safety boundary, we add a conformal prediction calibration step that adjusts the safety threshold and provides finite-sample probabilistic safety coverage. Empirically, SafeFQL trades modestly higher offline training cost for substantially lower inference latency than diffusion-style safe generative baselines, which is advantageous for real-time safety-critical deployment. Across boat navigation, and Safety Gymnasium MuJoCo tasks, SafeFQL matches or exceeds prior offline safe RL performance while substantially reducing constraint violations.