基于自省机制的元强化学习在智能搜索中的应用
Meta-Reinforcement Learning with Self-Reflection for Agentic Search
March 11, 2026
作者: Teng Xiao, Yige Yuan, Hamish Ivison, Huaisheng Zhu, Faeze Brahman, Nathan Lambert, Pradeep Dasigi, Noah A. Smith, Hannaneh Hajishirzi
cs.AI
摘要
本文提出MR-Search——一种基于情境元强化学习的智能搜索框架,该框架通过自我反思机制实现跨情景的策略优化。与传统在稀疏奖励的独立情景中优化策略的方法不同,MR-Search训练的策略能够关联历史搜索情景,并动态调整跨情景的搜索策略。该方法通过自我反思实现搜索策略的元学习,使智能体在测试阶段能够实现情境化探索能力的持续提升。具体而言,MR-Search通过在每个搜索情景后生成显式自我反思记录,并将其作为后续尝试的附加情境信息,实现跨情景的探索优化,从而在测试阶段促进更有效的探索行为。我们进一步提出多轮次强化学习算法,该算法在轮次层面估计密集相对优势值,实现对每个情景的细粒度功劳分配。多个基准测试的实验结果表明,MR-Search相较于基线强化学习方法具有显著优势,在八项基准测试中展现出强大的泛化能力,相对性能提升幅度达9.2%至19.3%。相关代码与数据已开源:https://github.com/tengxiao1/MR-Search。
English
This paper introduces MR-Search, an in-context meta reinforcement learning (RL) formulation for agentic search with self-reflection. Instead of optimizing a policy within a single independent episode with sparse rewards, MR-Search trains a policy that conditions on past episodes and adapts its search strategy across episodes. MR-Search learns to learn a search strategy with self-reflection, allowing search agents to improve in-context exploration at test-time. Specifically, MR-Search performs cross-episode exploration by generating explicit self-reflections after each episode and leveraging them as additional context to guide subsequent attempts, thereby promoting more effective exploration during test-time. We further introduce a multi-turn RL algorithm that estimates a dense relative advantage at the turn level, enabling fine-grained credit assignment on each episode. Empirical results across various benchmarks demonstrate the advantages of MR-Search over baselines based RL, showing strong generalization and relative improvements of 9.2% to 19.3% across eight benchmarks. Our code and data are available at https://github.com/tengxiao1/MR-Search.