互補強化學習

摘要

強化學習（RL）已成為訓練基於大型語言模型的智慧體的重要範式，但其樣本效率低下的問題依然存在。這不僅源於稀疏的結果回饋，更因智慧體無法跨情境利用過往經驗。雖然為智慧體注入歷史經驗是極具潛力的解決方案，現有方法卻存在關鍵缺陷：從歷史中萃取的經驗要麼被靜態儲存，要麼未能與持續優化的行動主體協同演化，導致經驗與行動主體進化能力間逐漸產生錯位，削弱了經驗在訓練過程中的效用。受神經科學中互補學習系統的啟發，我們提出「互補式強化學習」，實現經驗萃取器與策略行動主體在RL優化迴圈中的無縫協同演化。具體而言，行動主體透過稀疏的結果回饋進行優化，而經驗萃取器則根據其提煉的經驗是否顯著促進行動主體成功來調整自身，從而使經驗管理策略與行動主體成長能力保持同步進化。實證研究表明，互補式強化學習在單任務情境下較未學習經驗的結果導向型智慧體基準線性能提升10%，並在多任務環境中展現出強大的擴展性。這些成果確立了互補式強化學習作為高效經驗驅動型智慧體學習的新範式。

English

Reinforcement Learning (RL) has emerged as a powerful paradigm for training LLM-based agents, yet remains limited by low sample efficiency, stemming not only from sparse outcome feedback but also from the agent's inability to leverage prior experience across episodes. While augmenting agents with historical experience offers a promising remedy, existing approaches suffer from a critical weakness: the experience distilled from history is either stored statically or fail to coevolve with the improving actor, causing a progressive misalignment between the experience and the actor's evolving capability that diminishes its utility over the course of training. Inspired by complementary learning systems in neuroscience, we present Complementary RL to achieve seamless co-evolution of an experience extractor and a policy actor within the RL optimization loop. Specifically, the actor is optimized via sparse outcome-based rewards, while the experience extractor is optimized according to whether its distilled experiences demonstrably contribute to the actor's success, thereby evolving its experience management strategy in lockstep with the actor's growing capabilities. Empirically, Complementary RL outperforms outcome-based agentic RL baselines that do not learn from experience, achieving 10% performance improvement in single-task scenarios and exhibits robust scalability in multi-task settings. These results establish Complementary RL as a paradigm for efficient experience-driven agent learning.

互補強化學習

Complementary Reinforcement Learning

摘要

Support