互补强化学习

摘要

强化学习（RL）已成为训练基于大语言模型的智能体的强大范式，但其样本效率低下问题依然存在。这一局限不仅源于稀疏的结果反馈，更因为智能体无法跨任务周期利用先验经验。虽然通过历史经验增强智能体性能是颇具前景的解决思路，但现有方法存在关键缺陷：从历史中提炼的经验要么被静态存储，要么未能与持续优化的行为主体协同进化，导致经验与行为主体进化能力之间逐渐失配，从而削弱了训练过程中经验的效用。受神经科学中互补学习系统的启发，我们提出互补强化学习框架，实现经验提取器与策略行为主体在RL优化循环中的无缝协同进化。具体而言，行为主体通过稀疏结果奖励进行优化，而经验提取器则根据其提炼的经验是否切实促进行为主体成功来进行优化，从而使经验管理策略与行为主体不断增强的能力保持同步进化。实验表明，互补强化学习在单任务场景中比不学习经验的基于结果的智能体RL基线性能提升10%，并在多任务场景中展现出强大的可扩展性。这些成果确立了互补强化学习作为高效经验驱动型智能体学习的新范式。

English

Reinforcement Learning (RL) has emerged as a powerful paradigm for training LLM-based agents, yet remains limited by low sample efficiency, stemming not only from sparse outcome feedback but also from the agent's inability to leverage prior experience across episodes. While augmenting agents with historical experience offers a promising remedy, existing approaches suffer from a critical weakness: the experience distilled from history is either stored statically or fail to coevolve with the improving actor, causing a progressive misalignment between the experience and the actor's evolving capability that diminishes its utility over the course of training. Inspired by complementary learning systems in neuroscience, we present Complementary RL to achieve seamless co-evolution of an experience extractor and a policy actor within the RL optimization loop. Specifically, the actor is optimized via sparse outcome-based rewards, while the experience extractor is optimized according to whether its distilled experiences demonstrably contribute to the actor's success, thereby evolving its experience management strategy in lockstep with the actor's growing capabilities. Empirically, Complementary RL outperforms outcome-based agentic RL baselines that do not learn from experience, achieving 10% performance improvement in single-task scenarios and exhibits robust scalability in multi-task settings. These results establish Complementary RL as a paradigm for efficient experience-driven agent learning.