안전 흐름 Q-러닝: 도달 가능성 기반 흐름 정책을 활용한 오프라인 안전 강화 학습

초록

오프라인 안전 강화 학습(RL)은 엄격한 안전 제약 조건 하에서 정적 데이터셋으로부터 보상 극대화 정책을 탐구한다. 기존 방법들은 소프트 기대 비용 목적 함수나 반복적 생성 추론에 의존하는 경우가 많으며, 이는 안전이 중요한 실시간 제어에는 부족할 수 있다. 본 연구에서는 FQL을 안전 오프라인 RL로 확장한 Safe Flow Q-Learning(SafeFQL)을 제안한다. SafeFQL은 Hamilton-Jacobi 도달 가능성에서 영감을 받은 안전 가치 함수와 효율적인 one-step flow 정책을 결합한다. SafeFQL은 자기 일관성 벨만 순환을 통해 안전 가치를 학습하고, 행동 복제를 통해 flow 정책을 훈련하며, 이를 one-step 행위자로 정제하여 배포 시 거부 샘플링 없이 보상 극대화 안전 행동 선택을 수행한다. 학습된 안전 경계에서 유한 데이터 근사 오차를 고려하기 위해, 안전 임계값을 조정하고 유한 표본 확률적 안전 coverage를 제공하는 conformal prediction 보정 단계를 추가한다. 실험적으로 SafeFQL은 확산 기반 안전 생성 기준선 대비 오프라인 훈련 비용이 약간 높은 대신 추론 지연 시간을 현저히 줄여, 안전이 중요한 실시간 배포에 유리하다. 보트 항해 및 Safety Gymnasium MuJoCo 과제 전반에서 SafeFQL은 기존 오프라인 안전 RL 성능을 유지하거나 능가하면서 제약 위반을 크게 감소시킨다.

English

Offline safe reinforcement learning (RL) seeks reward-maximizing policies from static datasets under strict safety constraints. Existing methods often rely on soft expected-cost objectives or iterative generative inference, which can be insufficient for safety-critical real-time control. We propose Safe Flow Q-Learning (SafeFQL), which extends FQL to safe offline RL by combining a Hamilton--Jacobi reachability-inspired safety value function with an efficient one-step flow policy. SafeFQL learns the safety value via a self-consistency Bellman recursion, trains a flow policy by behavioral cloning, and distills it into a one-step actor for reward-maximizing safe action selection without rejection sampling at deployment. To account for finite-data approximation error in the learned safety boundary, we add a conformal prediction calibration step that adjusts the safety threshold and provides finite-sample probabilistic safety coverage. Empirically, SafeFQL trades modestly higher offline training cost for substantially lower inference latency than diffusion-style safe generative baselines, which is advantageous for real-time safety-critical deployment. Across boat navigation, and Safety Gymnasium MuJoCo tasks, SafeFQL matches or exceeds prior offline safe RL performance while substantially reducing constraint violations.

안전 흐름 Q-러닝: 도달 가능성 기반 흐름 정책을 활용한 오프라인 안전 강화 학습

Safe Flow Q-Learning: Offline Safe Reinforcement Learning with Reachability-Based Flow Policies

초록

Support