ChatPaper.aiChatPaper

MaskSearch:一种提升智能搜索能力的通用预训练框架

MaskSearch: A Universal Pre-Training Framework to Enhance Agentic Search Capability

May 26, 2025
作者: Weiqi Wu, Xin Guan, Shen Huang, Yong Jiang, Pengjun Xie, Fei Huang, Jiuxin Cao, Hai Zhao, Jingren Zhou
cs.AI

摘要

检索增强型语言模型(RALMs)代表了一种经典范式,其中模型通过专门模块检索外部知识来增强生成能力。近年来,智能体技术的进步使得大型语言模型(LLMs)能够自主利用工具进行检索、规划和推理。尽管现有的基于训练的方法展现出潜力,但其智能体能力受限于训练过程中使用的任务特定数据的固有特性。为了进一步提升智能体的通用搜索能力,我们提出了一种新颖的预训练框架——MaskSearch。在预训练阶段,我们引入了检索增强掩码预测(RAMP)任务,模型通过学习利用搜索工具在大量预训练数据上填补掩码片段,从而为LLMs获得通用的检索和推理能力。随后,模型在下游任务上进行训练以实现进一步改进。我们结合了监督微调(SFT)和强化学习(RL)进行训练。对于SFT,我们融合基于智能体和蒸馏的方法生成训练数据,首先构建一个包含规划器、重写器、观察者的多智能体系统,随后引入一个自我进化的教师模型。而对于RL,我们采用DAPO作为训练框架,并采用由答案奖励和格式奖励组成的混合奖励系统。此外,我们引入了一种课程学习方法,使模型能够根据掩码片段的数量从易到难逐步学习。我们在开放域多跳问答场景中评估了该框架的有效性。通过大量实验,我们证明了MaskSearch显著提升了基于LLM的搜索智能体在域内和域外下游任务上的表现。
English
Retrieval-Augmented Language Models (RALMs) represent a classic paradigm where models enhance generative capabilities using external knowledge retrieved via a specialized module. Recent advancements in Agent techniques enable Large Language Models (LLMs) to autonomously utilize tools for retrieval, planning, and reasoning. While existing training-based methods show promise, their agentic abilities are limited by inherent characteristics of the task-specific data used during training. To further enhance the universal search capability of agents, we propose a novel pre-training framework, MaskSearch. In the pre-training stage, we introduce the Retrieval Augmented Mask Prediction (RAMP) task, where the model learns to leverage search tools to fill masked spans on a large number of pre-training data, thus acquiring universal retrieval and reasoning capabilities for LLMs. After that, the model is trained on downstream tasks to achieve further improvement. We apply both Supervised Fine-tuning (SFT) and Reinforcement Learning (RL) for training. For SFT, we combine agent-based and distillation-based methods to generate training data, starting with a multi-agent system consisting of a planner, rewriter, observer, and followed by a self-evolving teacher model. While for RL, we employ DAPO as the training framework and adopt a hybrid reward system consisting of answer rewards and format rewards. Additionally, we introduce a curriculum learning approach that allows the model to learn progressively from easier to more challenging instances based on the number of masked spans. We evaluate the effectiveness of our framework in the scenario of open-domain multi-hop question answering. Through extensive experiments, we demonstrate that MaskSearch significantly enhances the performance of LLM-based search agents on both in-domain and out-of-domain downstream tasks.

Summary

AI-Generated Summary

PDF32June 3, 2025