ZeroSearch: 검색 없이 LLM의 검색 능력을 강화하기

초록

효과적인 정보 검색은 대규모 언어 모델(LLM)의 추론 및 생성 능력을 향상시키는 데 필수적입니다. 최근 연구에서는 실시간 검색 엔진과 상호작용하며 LLM의 검색 능력을 개선하기 위해 강화 학습(RL)을 활용하는 방법을 탐구해 왔습니다. 이러한 접근법은 유망한 결과를 보여주지만, 두 가지 주요 문제에 직면하고 있습니다: (1) 통제되지 않은 문서 품질: 검색 엔진이 반환하는 문서의 품질은 예측하기 어려워 훈련 과정에 노이즈와 불안정성을 초래합니다. (2) 과도한 API 비용: RL 훈련은 수십만 건의 검색 요청을 포함할 수 있는 빈번한 롤아웃을 필요로 하며, 이는 상당한 API 비용을 발생시키고 확장성을 심각하게 제한합니다. 이러한 문제를 해결하기 위해, 우리는 실제 검색 엔진과 상호작용하지 않고도 LLM의 검색 능력을 강화하는 강화 학습 프레임워크인 ZeroSearch를 소개합니다. 우리의 접근법은 LLM을 쿼리에 대해 관련성 있는 문서와 노이즈가 포함된 문서를 생성할 수 있는 검색 모듈로 변환하기 위한 경량의 지도 학습 미세 조정으로 시작합니다. RL 훈련 동안, 우리는 생성된 문서의 품질을 점진적으로 저하시키는 커리큘럼 기반 롤아웃 전략을 사용하여, 점점 더 어려운 검색 시나리오에 노출시킴으로써 모델의 추론 능력을 점진적으로 이끌어냅니다. 광범위한 실험을 통해 ZeroSearch가 3B LLM을 검색 모듈로 사용하여 LLM의 검색 능력을 효과적으로 강화함을 입증했습니다. 특히, 7B 검색 모듈은 실제 검색 엔진과 비슷한 성능을 보였으며, 14B 검색 모듈은 이를 능가하기까지 했습니다. 또한, 이 방법은 다양한 파라미터 크기의 기본 모델과 지시 튜닝 모델 모두에서 잘 일반화되며, 다양한 RL 알고리즘과 호환됩니다.

English

Effective information searching is essential for enhancing the reasoning and generation capabilities of large language models (LLMs). Recent research has explored using reinforcement learning (RL) to improve LLMs' search capabilities by interacting with live search engines in real-world environments. While these approaches show promising results, they face two major challenges: (1) Uncontrolled Document Quality: The quality of documents returned by search engines is often unpredictable, introducing noise and instability into the training process. (2) Prohibitively High API Costs: RL training requires frequent rollouts, potentially involving hundreds of thousands of search requests, which incur substantial API expenses and severely constrain scalability. To address these challenges, we introduce ZeroSearch, a reinforcement learning framework that incentivizes the search capabilities of LLMs without interacting with real search engines. Our approach begins with lightweight supervised fine-tuning to transform the LLM into a retrieval module capable of generating both relevant and noisy documents in response to a query. During RL training, we employ a curriculum-based rollout strategy that incrementally degrades the quality of generated documents, progressively eliciting the model's reasoning ability by exposing it to increasingly challenging retrieval scenarios. Extensive experiments demonstrate that ZeroSearch effectively incentivizes the search capabilities of LLMs using a 3B LLM as the retrieval module. Remarkably, a 7B retrieval module achieves comparable performance to the real search engine, while a 14B retrieval module even surpasses it. Furthermore, it generalizes well across both base and instruction-tuned models of various parameter sizes and is compatible with a wide range of RL algorithms.

ZeroSearch: 검색 없이 LLM의 검색 능력을 강화하기

ZeroSearch: Incentivize the Search Capability of LLMs without Searching

초록

Support