MaskSearch: 에이전트 탐색 능력 강화를 위한 범용 사전 학습 프레임워크

초록

검색 강화 언어 모델(Retrieval-Augmented Language Models, RALMs)은 특수화된 모듈을 통해 외부 지식을 검색하여 생성 능력을 향상시키는 고전적인 패러다임을 대표한다. 최근 에이전트 기술의 발전으로 대형 언어 모델(Large Language Models, LLMs)이 검색, 계획, 추론을 위해 도구를 자율적으로 활용할 수 있게 되었다. 기존의 훈련 기반 방법들은 유망한 결과를 보여주지만, 이러한 에이전트 능력은 훈련 중 사용된 작업 특정 데이터의 고유한 특성에 의해 제한된다. 에이전트의 보편적 검색 능력을 더욱 강화하기 위해, 우리는 새로운 사전 훈련 프레임워크인 MaskSearch를 제안한다. 사전 훈련 단계에서, 우리는 검색 강화 마스크 예측(Retrieval Augmented Mask Prediction, RAMP) 작업을 도입하여, 모델이 대량의 사전 훈련 데이터에서 마스크된 부분을 채우기 위해 검색 도구를 활용하는 방법을 학습함으로써 LLMs에 대한 보편적 검색 및 추론 능력을 습득하도록 한다. 이후, 모델은 하위 작업에 대해 추가적인 개선을 이루기 위해 훈련된다. 우리는 지도 미세 조정(Supervised Fine-tuning, SFT)과 강화 학습(Reinforcement Learning, RL)을 모두 적용하여 훈련을 진행한다. SFT의 경우, 에이전트 기반 및 증류 기반 방법을 결합하여 훈련 데이터를 생성하며, 이는 계획자, 재작성자, 관찰자로 구성된 다중 에이전트 시스템으로 시작하여 자기 진화 교사 모델로 이어진다. 반면, RL의 경우, DAPO를 훈련 프레임워크로 사용하고 답변 보상과 형식 보상으로 구성된 하이브리드 보상 시스템을 채택한다. 또한, 우리는 마스크된 부분의 수에 따라 모델이 쉬운 사례에서 점점 더 어려운 사례로 점진적으로 학습할 수 있도록 하는 커리큘럼 학습 접근 방식을 도입한다. 우리는 개방형 도메인 다중 홉 질문 응답 시나리오에서 우리의 프레임워크의 효과를 평가한다. 광범위한 실험을 통해, MaskSearch가 LLM 기반 검색 에이전트의 도메인 내 및 도메인 외 하위 작업에서의 성능을 크게 향상시킨다는 것을 입증한다.

English

Retrieval-Augmented Language Models (RALMs) represent a classic paradigm where models enhance generative capabilities using external knowledge retrieved via a specialized module. Recent advancements in Agent techniques enable Large Language Models (LLMs) to autonomously utilize tools for retrieval, planning, and reasoning. While existing training-based methods show promise, their agentic abilities are limited by inherent characteristics of the task-specific data used during training. To further enhance the universal search capability of agents, we propose a novel pre-training framework, MaskSearch. In the pre-training stage, we introduce the Retrieval Augmented Mask Prediction (RAMP) task, where the model learns to leverage search tools to fill masked spans on a large number of pre-training data, thus acquiring universal retrieval and reasoning capabilities for LLMs. After that, the model is trained on downstream tasks to achieve further improvement. We apply both Supervised Fine-tuning (SFT) and Reinforcement Learning (RL) for training. For SFT, we combine agent-based and distillation-based methods to generate training data, starting with a multi-agent system consisting of a planner, rewriter, observer, and followed by a self-evolving teacher model. While for RL, we employ DAPO as the training framework and adopt a hybrid reward system consisting of answer rewards and format rewards. Additionally, we introduce a curriculum learning approach that allows the model to learn progressively from easier to more challenging instances based on the number of masked spans. We evaluate the effectiveness of our framework in the scenario of open-domain multi-hop question answering. Through extensive experiments, we demonstrate that MaskSearch significantly enhances the performance of LLM-based search agents on both in-domain and out-of-domain downstream tasks.

MaskSearch: 에이전트 탐색 능력 강화를 위한 범용 사전 학습 프레임워크

MaskSearch: A Universal Pre-Training Framework to Enhance Agentic Search Capability

초록

Support