可微分进化强化学习
Differentiable Evolutionary Reinforcement Learning
December 15, 2025
作者: Sitao Cheng, Tianle Li, Xuhan Huang, Xunjian Yin, Difan Zou
cs.AI
摘要
设计有效的奖励函数是强化学习(RL)领域的核心挑战,尤其在为复杂推理任务开发自主智能体时更为艰巨。虽然存在自动化奖励优化方法,但它们通常依赖将奖励函数视为黑箱的无导数进化启发式算法,难以捕捉奖励结构与任务性能之间的因果关系。为弥补这一差距,我们提出可微分进化强化学习(DERL)——一种能够自主发现最优奖励信号的双层框架。在DERL中,元优化器通过组合结构化原子基元来演化奖励函数(即元奖励),从而指导内层策略的训练。与以往进化方法的关键区别在于,DERL在元优化层面实现了可微分性:它将内层验证性能作为信号,通过强化学习更新元优化器。这使得DERL能够近似任务成功的"元梯度",逐步学会生成更密集且更具指导性的反馈。我们在三个领域验证DERL:机器人智能体(ALFWorld)、科学模拟(ScienceWorld)和数学推理(GSM8k、MATH)。实验结果表明,DERL在ALFWorld和ScienceWorld上达到最先进性能,尤其在分布外场景中显著优于依赖启发式奖励的方法。对进化轨迹的分析表明,DERL成功捕捉了任务的内在结构,实现了无需人工干预的自改进智能体对齐。
English
The design of effective reward functions presents a central and often arduous challenge in reinforcement learning (RL), particularly when developing autonomous agents for complex reasoning tasks. While automated reward optimization approaches exist, they typically rely on derivative-free evolutionary heuristics that treat the reward function as a black box, failing to capture the causal relationship between reward structure and task performance. To bridge this gap, we propose Differentiable Evolutionary Reinforcement Learning (DERL), a bilevel framework that enables the autonomous discovery of optimal reward signals. In DERL, a Meta-Optimizer evolves a reward function (i.e., Meta-Reward) by composing structured atomic primitives, guiding the training of an inner-loop policy. Crucially, unlike previous evolution, DERL is differentiable in its metaoptimization: it treats the inner-loop validation performance as a signal to update the Meta-Optimizer via reinforcement learning. This allows DERL to approximate the "meta-gradient" of task success, progressively learning to generate denser and more actionable feedback. We validate DERL across three distinct domains: robotic agent (ALFWorld), scientific simulation (ScienceWorld), and mathematical reasoning (GSM8k, MATH). Experimental results show that DERL achieves state-of-the-art performance on ALFWorld and ScienceWorld, significantly outperforming methods relying on heuristic rewards, especially in out-of-distribution scenarios. Analysis of the evolutionary trajectory demonstrates that DERL successfully captures the intrinsic structure of tasks, enabling selfimproving agent alignment without human intervention.