자기 성찰을 통한 메타 강화 학습을 활용한 에이전트 탐색

초록

본 논문은 자기 성찰(self-reflection)을 통한 에이전트 탐색을 위한 인-컨텍스트 메타 강화학습(Meta-RL) 프레임워크인 MR-Search를 소개한다. MR-Search는 희소한 보상이 주어지는 단일 독립 에피소드 내에서 정책을 최적화하는 대신, 과거 에피소드에 조건을 두고 에피소드 간 탐색 전략을 적응시키는 정책을 학습한다. MR-Search는 자기 성찰을 통한 탐색 전략을 학습함으로써 탐색 에이전트가 테스트 시점에 인-컨텍스트 탐색 성능을 향상시킬 수 있도록 한다. 구체적으로, MR-Search는 각 에피소드 이후 명시적인 자기 성찰을 생성하고 이를 후속 시도의 추가 컨텍스트로 활용하여 교차 에피소드 탐색을 수행함으로써 테스트 시간 동안 더 효과적인 탐색을 촉진한다. 또한 본 연구는 턴 수준에서 조밀한 상대적 이점(dense relative advantage)을 추정하는 다중 턴 RL 알고리즘을 도입하여 각 에피소드에 대한 세밀한 크레딧 할당을 가능하게 한다. 다양한 벤치마크에 대한 실험 결과는 MR-Search가 기준 RL 기반 방법들보다 우수함을 보여주며, 강력한 일반화 성능과 8개 벤치마크에서 9.2%에서 19.3%에 이르는 상대적 성능 향상을 입증한다. 코드와 데이터는 https://github.com/tengxiao1/MR-Search에서 확인할 수 있다.

English

This paper introduces MR-Search, an in-context meta reinforcement learning (RL) formulation for agentic search with self-reflection. Instead of optimizing a policy within a single independent episode with sparse rewards, MR-Search trains a policy that conditions on past episodes and adapts its search strategy across episodes. MR-Search learns to learn a search strategy with self-reflection, allowing search agents to improve in-context exploration at test-time. Specifically, MR-Search performs cross-episode exploration by generating explicit self-reflections after each episode and leveraging them as additional context to guide subsequent attempts, thereby promoting more effective exploration during test-time. We further introduce a multi-turn RL algorithm that estimates a dense relative advantage at the turn level, enabling fine-grained credit assignment on each episode. Empirical results across various benchmarks demonstrate the advantages of MR-Search over baselines based RL, showing strong generalization and relative improvements of 9.2% to 19.3% across eight benchmarks. Our code and data are available at https://github.com/tengxiao1/MR-Search.

자기 성찰을 통한 메타 강화 학습을 활용한 에이전트 탐색

Meta-Reinforcement Learning with Self-Reflection for Agentic Search

초록

Support