ZeroSearch：無需搜索即可激勵大型語言模型的搜索能力

摘要

有效的資訊搜尋對於提升大型語言模型（LLMs）的推理與生成能力至關重要。近期研究探索了利用強化學習（RL）來增強LLMs的搜尋能力，透過與現實環境中的即時搜尋引擎互動。儘管這些方法展現出令人鼓舞的成果，它們面臨兩大挑戰：（1）文件品質不可控：搜尋引擎返回的文件品質往往難以預測，這為訓練過程引入了噪音與不穩定性。（2）過高的API成本：RL訓練需要頻繁的rollout，可能涉及數十萬次搜尋請求，導致巨額API費用，嚴重限制了可擴展性。為應對這些挑戰，我們提出了ZeroSearch，這是一個強化學習框架，旨在激勵LLMs的搜尋能力，而無需與真實搜尋引擎互動。我們的方法始於輕量級的監督微調，將LLM轉化為一個檢索模組，能夠針對查詢生成相關且帶有噪音的文件。在RL訓練期間，我們採用基於課程的rollout策略，逐步降低生成文件的品質，透過讓模型面對日益挑戰的檢索場景，逐步激發其推理能力。大量實驗證明，ZeroSearch有效激勵了LLMs的搜尋能力，使用3B LLM作為檢索模組。值得注意的是，7B檢索模組的表現與真實搜尋引擎相當，而14B檢索模組甚至超越了它。此外，該方法在各種參數規模的基礎模型與指令微調模型上均表現出良好的泛化能力，並兼容多種RL算法。

English

Effective information searching is essential for enhancing the reasoning and generation capabilities of large language models (LLMs). Recent research has explored using reinforcement learning (RL) to improve LLMs' search capabilities by interacting with live search engines in real-world environments. While these approaches show promising results, they face two major challenges: (1) Uncontrolled Document Quality: The quality of documents returned by search engines is often unpredictable, introducing noise and instability into the training process. (2) Prohibitively High API Costs: RL training requires frequent rollouts, potentially involving hundreds of thousands of search requests, which incur substantial API expenses and severely constrain scalability. To address these challenges, we introduce ZeroSearch, a reinforcement learning framework that incentivizes the search capabilities of LLMs without interacting with real search engines. Our approach begins with lightweight supervised fine-tuning to transform the LLM into a retrieval module capable of generating both relevant and noisy documents in response to a query. During RL training, we employ a curriculum-based rollout strategy that incrementally degrades the quality of generated documents, progressively eliciting the model's reasoning ability by exposing it to increasingly challenging retrieval scenarios. Extensive experiments demonstrate that ZeroSearch effectively incentivizes the search capabilities of LLMs using a 3B LLM as the retrieval module. Remarkably, a 7B retrieval module achieves comparable performance to the real search engine, while a 14B retrieval module even surpasses it. Furthermore, it generalizes well across both base and instruction-tuned models of various parameter sizes and is compatible with a wide range of RL algorithms.

ZeroSearch：無需搜索即可激勵大型語言模型的搜索能力

ZeroSearch: Incentivize the Search Capability of LLMs without Searching

摘要

Support