ZeroSearch: 検索を行わずにLLMの検索能力を促進する

要旨

効果的な情報検索は、大規模言語モデル（LLM）の推論能力と生成能力を向上させるために不可欠である。最近の研究では、現実世界の環境で実際の検索エンジンと対話することで、LLMの検索能力を強化するために強化学習（RL）を活用する手法が探求されている。これらのアプローチは有望な結果を示しているが、二つの主要な課題に直面している：(1) ドキュメント品質の制御不能性：検索エンジンが返すドキュメントの品質は予測不可能であり、トレーニングプロセスにノイズと不安定性をもたらす。(2) 過剰なAPIコスト：RLトレーニングは頻繁なロールアウトを必要とし、数十万回の検索リクエストが発生する可能性があり、これにより莫大なAPI費用がかかり、スケーラビリティが大幅に制約される。これらの課題に対処するため、我々はZeroSearchを提案する。これは、実際の検索エンジンと対話することなく、LLMの検索能力を促進する強化学習フレームワークである。我々のアプローチは、軽量な教師ありファインチューニングから始まり、LLMを検索モジュールに変換し、クエリに対して関連性のあるドキュメントとノイズを含むドキュメントを生成できるようにする。RLトレーニング中には、カリキュラムベースのロールアウト戦略を採用し、生成されるドキュメントの品質を段階的に低下させることで、モデルに次第に困難な検索シナリオを提示し、その推論能力を徐々に引き出す。広範な実験により、ZeroSearchが3BのLLMを検索モジュールとして使用することで、LLMの検索能力を効果的に促進することが示された。特に、7Bの検索モジュールは実際の検索エンジンと同等の性能を達成し、14Bの検索モジュールはそれを上回る性能を示した。さらに、この手法は様々なパラメータサイズのベースモデルおよび指示チューニングモデルにおいても良好な汎化性能を示し、幅広いRLアルゴリズムと互換性がある。

English

Effective information searching is essential for enhancing the reasoning and generation capabilities of large language models (LLMs). Recent research has explored using reinforcement learning (RL) to improve LLMs' search capabilities by interacting with live search engines in real-world environments. While these approaches show promising results, they face two major challenges: (1) Uncontrolled Document Quality: The quality of documents returned by search engines is often unpredictable, introducing noise and instability into the training process. (2) Prohibitively High API Costs: RL training requires frequent rollouts, potentially involving hundreds of thousands of search requests, which incur substantial API expenses and severely constrain scalability. To address these challenges, we introduce ZeroSearch, a reinforcement learning framework that incentivizes the search capabilities of LLMs without interacting with real search engines. Our approach begins with lightweight supervised fine-tuning to transform the LLM into a retrieval module capable of generating both relevant and noisy documents in response to a query. During RL training, we employ a curriculum-based rollout strategy that incrementally degrades the quality of generated documents, progressively eliciting the model's reasoning ability by exposing it to increasingly challenging retrieval scenarios. Extensive experiments demonstrate that ZeroSearch effectively incentivizes the search capabilities of LLMs using a 3B LLM as the retrieval module. Remarkably, a 7B retrieval module achieves comparable performance to the real search engine, while a 14B retrieval module even surpasses it. Furthermore, it generalizes well across both base and instruction-tuned models of various parameter sizes and is compatible with a wide range of RL algorithms.

ZeroSearch: 検索を行わずにLLMの検索能力を促進する

ZeroSearch: Incentivize the Search Capability of LLMs without Searching

要旨

Support