基于自省机制的元强化学习在智能搜索中的应用

摘要

本文提出MR-Search——一种基于情境元强化学习的智能搜索框架，该框架通过自我反思机制实现跨情景的策略优化。与传统在稀疏奖励的独立情景中优化策略的方法不同，MR-Search训练的策略能够关联历史搜索情景，并动态调整跨情景的搜索策略。该方法通过自我反思实现搜索策略的元学习，使智能体在测试阶段能够实现情境化探索能力的持续提升。具体而言，MR-Search通过在每个搜索情景后生成显式自我反思记录，并将其作为后续尝试的附加情境信息，实现跨情景的探索优化，从而在测试阶段促进更有效的探索行为。我们进一步提出多轮次强化学习算法，该算法在轮次层面估计密集相对优势值，实现对每个情景的细粒度功劳分配。多个基准测试的实验结果表明，MR-Search相较于基线强化学习方法具有显著优势，在八项基准测试中展现出强大的泛化能力，相对性能提升幅度达9.2%至19.3%。相关代码与数据已开源：https://github.com/tengxiao1/MR-Search。

English

This paper introduces MR-Search, an in-context meta reinforcement learning (RL) formulation for agentic search with self-reflection. Instead of optimizing a policy within a single independent episode with sparse rewards, MR-Search trains a policy that conditions on past episodes and adapts its search strategy across episodes. MR-Search learns to learn a search strategy with self-reflection, allowing search agents to improve in-context exploration at test-time. Specifically, MR-Search performs cross-episode exploration by generating explicit self-reflections after each episode and leveraging them as additional context to guide subsequent attempts, thereby promoting more effective exploration during test-time. We further introduce a multi-turn RL algorithm that estimates a dense relative advantage at the turn level, enabling fine-grained credit assignment on each episode. Empirical results across various benchmarks demonstrate the advantages of MR-Search over baselines based RL, showing strong generalization and relative improvements of 9.2% to 19.3% across eight benchmarks. Our code and data are available at https://github.com/tengxiao1/MR-Search.

基于自省机制的元强化学习在智能搜索中的应用

Meta-Reinforcement Learning with Self-Reflection for Agentic Search

摘要

Support