自己内省によるメタ強化学習を用いたエージェンシック検索

要旨

本論文では、自己反省を備えたエージェント検索のためのインコンテキストメタ強化学習（RL）フレームワークであるMR-Searchを提案する。単一の独立したエピソード内で報酬が疎な方策を最適化する代わりに、MR-Searchは過去のエピソードを条件とし、エピソードを跨いで検索戦略を適応させる方策を訓練する。MR-Searchは自己反省による検索戦略を学習することを学び、検索エージェントがテスト時にインコンテキストな探索を改善できるようにする。具体的には、MR-Searchは各エピソード後に明示的な自己反省を生成し、それを追加の文脈として活用して後続の試行を導くことで、エピソード間探索を実行し、テスト時のより効果的な探索を促進する。さらに、ターンレベルで密な相対的アドバンテージを推定するマルチターンRLアルゴリズムを導入し、各エピソードに対するきめ細かい信用割り当てを可能にする。様々なベンチマークによる実験結果は、ベースラインRL手法に対するMR-Searchの優位性を示しており、8つのベンチマークで強力な一般化性能と9.2%から19.3%の相対的改善を実証している。コードとデータはhttps://github.com/tengxiao1/MR-Search で公開されている。

English

This paper introduces MR-Search, an in-context meta reinforcement learning (RL) formulation for agentic search with self-reflection. Instead of optimizing a policy within a single independent episode with sparse rewards, MR-Search trains a policy that conditions on past episodes and adapts its search strategy across episodes. MR-Search learns to learn a search strategy with self-reflection, allowing search agents to improve in-context exploration at test-time. Specifically, MR-Search performs cross-episode exploration by generating explicit self-reflections after each episode and leveraging them as additional context to guide subsequent attempts, thereby promoting more effective exploration during test-time. We further introduce a multi-turn RL algorithm that estimates a dense relative advantage at the turn level, enabling fine-grained credit assignment on each episode. Empirical results across various benchmarks demonstrate the advantages of MR-Search over baselines based RL, showing strong generalization and relative improvements of 9.2% to 19.3% across eight benchmarks. Our code and data are available at https://github.com/tengxiao1/MR-Search.

自己内省によるメタ強化学習を用いたエージェンシック検索

Meta-Reinforcement Learning with Self-Reflection for Agentic Search

要旨

Support