ChatPaper.aiChatPaper

MaskSearch:提升代理搜索能力的通用預訓練框架

MaskSearch: A Universal Pre-Training Framework to Enhance Agentic Search Capability

May 26, 2025
作者: Weiqi Wu, Xin Guan, Shen Huang, Yong Jiang, Pengjun Xie, Fei Huang, Jiuxin Cao, Hai Zhao, Jingren Zhou
cs.AI

摘要

檢索增強型語言模型(RALMs)代表了一種經典範式,其中模型通過專門模塊檢索外部知識來增強生成能力。近期,代理技術的進步使得大型語言模型(LLMs)能夠自主利用工具進行檢索、規劃和推理。儘管現有的基於訓練的方法顯示出潛力,但其代理能力受到訓練期間使用的特定任務數據固有特性的限制。為了進一步提升代理的通用搜索能力,我們提出了一種新穎的預訓練框架——MaskSearch。在預訓練階段,我們引入了檢索增強掩碼預測(RAMP)任務,模型在此任務中學習利用搜索工具來填充大量預訓練數據中的掩碼片段,從而為LLMs獲取通用的檢索和推理能力。之後,模型在下游任務上進行訓練以實現進一步的提升。我們結合了監督微調(SFT)和強化學習(RL)進行訓練。對於SFT,我們結合基於代理和基於蒸餾的方法生成訓練數據,首先構建一個由規劃者、重寫者、觀察者組成的多代理系統,隨後是一個自我進化的教師模型。而在RL方面,我們採用DAPO作為訓練框架,並採用由答案獎勵和格式獎勵組成的混合獎勵系統。此外,我們引入了一種課程學習方法,使模型能夠根據掩碼片段的數量從易到難逐步學習。我們在開放域多跳問答場景中評估了我們框架的有效性。通過大量實驗,我們證明MaskSearch顯著提升了基於LLM的搜索代理在域內和域外下游任務上的性能。
English
Retrieval-Augmented Language Models (RALMs) represent a classic paradigm where models enhance generative capabilities using external knowledge retrieved via a specialized module. Recent advancements in Agent techniques enable Large Language Models (LLMs) to autonomously utilize tools for retrieval, planning, and reasoning. While existing training-based methods show promise, their agentic abilities are limited by inherent characteristics of the task-specific data used during training. To further enhance the universal search capability of agents, we propose a novel pre-training framework, MaskSearch. In the pre-training stage, we introduce the Retrieval Augmented Mask Prediction (RAMP) task, where the model learns to leverage search tools to fill masked spans on a large number of pre-training data, thus acquiring universal retrieval and reasoning capabilities for LLMs. After that, the model is trained on downstream tasks to achieve further improvement. We apply both Supervised Fine-tuning (SFT) and Reinforcement Learning (RL) for training. For SFT, we combine agent-based and distillation-based methods to generate training data, starting with a multi-agent system consisting of a planner, rewriter, observer, and followed by a self-evolving teacher model. While for RL, we employ DAPO as the training framework and adopt a hybrid reward system consisting of answer rewards and format rewards. Additionally, we introduce a curriculum learning approach that allows the model to learn progressively from easier to more challenging instances based on the number of masked spans. We evaluate the effectiveness of our framework in the scenario of open-domain multi-hop question answering. Through extensive experiments, we demonstrate that MaskSearch significantly enhances the performance of LLM-based search agents on both in-domain and out-of-domain downstream tasks.
PDF32June 3, 2025