搜尋-R1：運用強化學習訓練大型語言模型進行推理並利用搜尋引擎

摘要

高效獲取外部知識與最新資訊，對於大型語言模型（LLMs）進行有效推理與文本生成至關重要。檢索增強與工具使用訓練方法，如將搜尋引擎視為工具，往往缺乏複雜的多輪檢索靈活性，或需要大規模的監督數據。在推理過程中提示具備推理能力的高級LLMs使用搜尋引擎並非最佳方案，因為LLM並未學會如何與搜尋引擎進行最佳互動。本文介紹了Search-R1，作為DeepSeek-R1模型的擴展，其中LLM僅通過強化學習（RL）自主生成（多個）搜尋查詢，在逐步推理過程中實現實時檢索。Search-R1通過多輪搜尋互動優化LLM的展開，利用檢索到的令牌遮罩來穩定RL訓練，並採用基於結果的簡單獎勵函數。在七個問答數據集上的實驗表明，Search-R1相較於最先進的基線模型，性能提升了26%（Qwen2.5-7B）、21%（Qwen2.5-3B）和10%（LLaMA3.2-3B）。本文進一步提供了關於RL優化方法、LLM選擇及檢索增強推理中回應長度動態的實證洞察。代碼與模型檢查點可在https://github.com/PeterGriffinJin/Search-R1獲取。

English

Efficiently acquiring external knowledge and up-to-date information is essential for effective reasoning and text generation in large language models (LLMs). Retrieval augmentation and tool-use training approaches where a search engine is treated as a tool lack complex multi-turn retrieval flexibility or require large-scale supervised data. Prompting advanced LLMs with reasoning capabilities during inference to use search engines is not optimal, since the LLM does not learn how to optimally interact with the search engine. This paper introduces Search-R1, an extension of the DeepSeek-R1 model where the LLM learns -- solely through reinforcement learning (RL) -- to autonomously generate (multiple) search queries during step-by-step reasoning with real-time retrieval. Search-R1 optimizes LLM rollouts with multi-turn search interactions, leveraging retrieved token masking for stable RL training and a simple outcome-based reward function. Experiments on seven question-answering datasets show that Search-R1 improves performance by 26% (Qwen2.5-7B), 21% (Qwen2.5-3B), and 10% (LLaMA3.2-3B) over SOTA baselines. This paper further provides empirical insights into RL optimization methods, LLM choices, and response length dynamics in retrieval-augmented reasoning. The code and model checkpoints are available at https://github.com/PeterGriffinJin/Search-R1.

搜尋-R1：運用強化學習訓練大型語言模型進行推理並利用搜尋引擎

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

摘要

Support