基于自我反思的元强化学习在智能搜索中的应用
Meta-Reinforcement Learning with Self-Reflection for Agentic Search
March 11, 2026
作者: Teng Xiao, Yige Yuan, Hamish Ivison, Huaisheng Zhu, Faeze Brahman, Nathan Lambert, Pradeep Dasigi, Noah A. Smith, Hannaneh Hajishirzi
cs.AI
摘要
本文提出MR-Search——一种具有自我反思能力的上下文元强化学习框架,用于智能体搜索任务。与传统在稀疏奖励的独立片段中优化策略的方法不同,MR-Search通过跨片段条件化历史经验来训练自适应搜索策略。该框架通过自我反思实现策略的上下文学习,使搜索智能体在测试阶段能够动态优化探索方式。具体而言,MR-Search实施跨片段探索机制:在每个任务片段后生成显式自我反思,并将其作为上下文指导后续搜索尝试,从而提升测试时的探索效率。我们进一步提出多轮次强化学习算法,通过计算轮次层面的稠密相对优势值,实现细粒度的分片段信用分配。在多个基准测试上的实验结果表明,MR-Search相较于基线方法具有显著优势,在八项基准中展现出强泛化能力,相对性能提升达9.2%至19.3%。代码与数据已开源:https://github.com/tengxiao1/MR-Search。
English
This paper introduces MR-Search, an in-context meta reinforcement learning (RL) formulation for agentic search with self-reflection. Instead of optimizing a policy within a single independent episode with sparse rewards, MR-Search trains a policy that conditions on past episodes and adapts its search strategy across episodes. MR-Search learns to learn a search strategy with self-reflection, allowing search agents to improve in-context exploration at test-time. Specifically, MR-Search performs cross-episode exploration by generating explicit self-reflections after each episode and leveraging them as additional context to guide subsequent attempts, thereby promoting more effective exploration during test-time. We further introduce a multi-turn RL algorithm that estimates a dense relative advantage at the turn level, enabling fine-grained credit assignment on each episode. Empirical results across various benchmarks demonstrate the advantages of MR-Search over baselines based RL, showing strong generalization and relative improvements of 9.2% to 19.3% across eight benchmarks. Our code and data are available at https://github.com/tengxiao1/MR-Search.