MaskSearch: エージェント検索能力を強化するための汎用事前学習フレームワーク

要旨

検索拡張型言語モデル（Retrieval-Augmented Language Models, RALMs）は、外部知識を専門モジュールを介して検索し、生成能力を強化する古典的なパラダイムを表している。近年のエージェント技術の進展により、大規模言語モデル（Large Language Models, LLMs）が自律的に検索、計画、推論のためのツールを利用できるようになった。既存の訓練ベースの手法は有望であるものの、そのエージェント能力は訓練時に使用されるタスク固有のデータの特性によって制限されている。エージェントの汎用的な検索能力をさらに向上させるため、我々は新たな事前学習フレームワーク「MaskSearch」を提案する。事前学習段階では、モデルが検索ツールを活用して大量の事前学習データ上のマスクされたスパンを埋める「Retrieval Augmented Mask Prediction（RAMP）」タスクを導入し、LLMsに汎用的な検索および推論能力を習得させる。その後、モデルは下流タスクに対して訓練され、さらなる改善を図る。訓練には、教師あり微調整（Supervised Fine-tuning, SFT）と強化学習（Reinforcement Learning, RL）の両方を適用する。SFTでは、エージェントベースと蒸留ベースの手法を組み合わせて訓練データを生成し、プランナー、リライター、オブザーバーからなるマルチエージェントシステムを起点として、自己進化する教師モデルを構築する。一方、RLでは、DAPOを訓練フレームワークとして採用し、回答報酬と形式報酬からなるハイブリッド報酬システムを採用する。さらに、マスクされたスパンの数に基づいて、モデルが容易なインスタンスからより困難なインスタンスへと段階的に学習するカリキュラム学習アプローチを導入する。我々は、オープンドメインのマルチホップ質問応答シナリオにおいて、本フレームワークの有効性を評価する。広範な実験を通じて、MaskSearchがLLMベースの検索エージェントのドメイン内およびドメイン外の下流タスクにおける性能を大幅に向上させることを実証する。

English

Retrieval-Augmented Language Models (RALMs) represent a classic paradigm where models enhance generative capabilities using external knowledge retrieved via a specialized module. Recent advancements in Agent techniques enable Large Language Models (LLMs) to autonomously utilize tools for retrieval, planning, and reasoning. While existing training-based methods show promise, their agentic abilities are limited by inherent characteristics of the task-specific data used during training. To further enhance the universal search capability of agents, we propose a novel pre-training framework, MaskSearch. In the pre-training stage, we introduce the Retrieval Augmented Mask Prediction (RAMP) task, where the model learns to leverage search tools to fill masked spans on a large number of pre-training data, thus acquiring universal retrieval and reasoning capabilities for LLMs. After that, the model is trained on downstream tasks to achieve further improvement. We apply both Supervised Fine-tuning (SFT) and Reinforcement Learning (RL) for training. For SFT, we combine agent-based and distillation-based methods to generate training data, starting with a multi-agent system consisting of a planner, rewriter, observer, and followed by a self-evolving teacher model. While for RL, we employ DAPO as the training framework and adopt a hybrid reward system consisting of answer rewards and format rewards. Additionally, we introduce a curriculum learning approach that allows the model to learn progressively from easier to more challenging instances based on the number of masked spans. We evaluate the effectiveness of our framework in the scenario of open-domain multi-hop question answering. Through extensive experiments, we demonstrate that MaskSearch significantly enhances the performance of LLM-based search agents on both in-domain and out-of-domain downstream tasks.

MaskSearch: エージェント検索能力を強化するための汎用事前学習フレームワーク

MaskSearch: A Universal Pre-Training Framework to Enhance Agentic Search Capability

要旨

Support