零搜索：无需搜索即可激发大语言模型的检索能力

摘要

高效的信息检索对于提升大型语言模型（LLMs）的推理与生成能力至关重要。近期研究探索了利用强化学习（RL）通过在实际环境中与实时搜索引擎交互来增强LLMs的搜索能力。尽管这些方法展现出积极成果，但它们面临两大挑战：（1）文档质量不可控：搜索引擎返回的文档质量往往难以预测，为训练过程引入了噪声与不稳定性。（2）API成本过高：RL训练需频繁执行搜索请求，可能涉及数十万次搜索，导致高昂的API费用，严重限制了可扩展性。为应对这些挑战，我们提出了ZeroSearch，一个无需与真实搜索引擎交互即可激励LLMs搜索能力的强化学习框架。我们的方法始于轻量级的监督微调，将LLM转化为一个检索模块，能够针对查询生成相关及噪声文档。在RL训练期间，我们采用基于课程学习的rollout策略，逐步降低生成文档的质量，通过让模型面对日益复杂的检索场景，渐进地激发其推理能力。大量实验表明，ZeroSearch利用3B参数的LLM作为检索模块，有效激励了LLMs的搜索能力。值得注意的是，7B参数的检索模块性能与真实搜索引擎相当，而14B参数的检索模块甚至超越了后者。此外，该方法在多种参数规模的基础模型及指令调优模型上均表现出良好的泛化能力，并与广泛的RL算法兼容。

English

Effective information searching is essential for enhancing the reasoning and generation capabilities of large language models (LLMs). Recent research has explored using reinforcement learning (RL) to improve LLMs' search capabilities by interacting with live search engines in real-world environments. While these approaches show promising results, they face two major challenges: (1) Uncontrolled Document Quality: The quality of documents returned by search engines is often unpredictable, introducing noise and instability into the training process. (2) Prohibitively High API Costs: RL training requires frequent rollouts, potentially involving hundreds of thousands of search requests, which incur substantial API expenses and severely constrain scalability. To address these challenges, we introduce ZeroSearch, a reinforcement learning framework that incentivizes the search capabilities of LLMs without interacting with real search engines. Our approach begins with lightweight supervised fine-tuning to transform the LLM into a retrieval module capable of generating both relevant and noisy documents in response to a query. During RL training, we employ a curriculum-based rollout strategy that incrementally degrades the quality of generated documents, progressively eliciting the model's reasoning ability by exposing it to increasingly challenging retrieval scenarios. Extensive experiments demonstrate that ZeroSearch effectively incentivizes the search capabilities of LLMs using a 3B LLM as the retrieval module. Remarkably, a 7B retrieval module achieves comparable performance to the real search engine, while a 14B retrieval module even surpasses it. Furthermore, it generalizes well across both base and instruction-tuned models of various parameter sizes and is compatible with a wide range of RL algorithms.

零搜索：无需搜索即可激发大语言模型的检索能力

ZeroSearch: Incentivize the Search Capability of LLMs without Searching

摘要

Support