ChatPaper.aiChatPaper

FLAC:基于动能正则化桥匹配的最大熵强化学习

FLAC: Maximum Entropy RL via Kinetic Energy Regularized Bridge Matching

February 13, 2026
作者: Lei Lv, Yunfei Li, Yu Luo, Fuchun Sun, Xiao Ma
cs.AI

摘要

诸如扩散模型与流匹配等迭代生成策略虽为连续控制提供了卓越的表达能力,但由于其动作对数密度无法直接获取,使得最大熵强化学习问题复杂化。为此,我们提出场最小能量行动者-评论者框架(FLAC),这一无需似然估计的方法通过惩罚速度场的动能来调控策略随机性。我们的核心洞见是将策略优化问题构建为相对于高熵参考过程(如均匀分布)的广义薛定谔桥问题。在此视角下,最大熵原理自然体现为在优化回报的同时保持与高熵参考的接近度,而无需显式计算动作密度。该框架中,动能作为衡量与参考过程偏离的物理基础代理指标:最小化路径空间能量可约束诱导终端动作分布的偏离程度。基于此观点,我们推导出能量正则化的策略迭代方案及实用的离策略算法,后者通过拉格朗日对偶机制自动调节动能。实验表明,FLAC在高维基准测试中相较于强基线方法取得更优或相当的性能,同时避免了显式密度估计。
English
Iterative generative policies, such as diffusion models and flow matching, offer superior expressivity for continuous control but complicate Maximum Entropy Reinforcement Learning because their action log-densities are not directly accessible. To address this, we propose Field Least-Energy Actor-Critic (FLAC), a likelihood-free framework that regulates policy stochasticity by penalizing the kinetic energy of the velocity field. Our key insight is to formulate policy optimization as a Generalized Schrödinger Bridge (GSB) problem relative to a high-entropy reference process (e.g., uniform). Under this view, the maximum-entropy principle emerges naturally as staying close to a high-entropy reference while optimizing return, without requiring explicit action densities. In this framework, kinetic energy serves as a physically grounded proxy for divergence from the reference: minimizing path-space energy bounds the deviation of the induced terminal action distribution. Building on this view, we derive an energy-regularized policy iteration scheme and a practical off-policy algorithm that automatically tunes the kinetic energy via a Lagrangian dual mechanism. Empirically, FLAC achieves superior or comparable performance on high-dimensional benchmarks relative to strong baselines, while avoiding explicit density estimation.
PDF32February 17, 2026