自搜索強化學習：SSRL

摘要

本研究探討大型語言模型（LLMs）作為強化學習（RL）中代理搜索任務高效模擬器的潛力，從而減少對外部搜索引擎昂貴交互的依賴。為此，我們首先通過結構化提示和重複採樣量化LLMs的內在搜索能力，稱之為自我搜索（Self-Search）。結果顯示，LLMs在推理預算方面表現出強烈的規模效應，在問答基準測試（包括具有挑戰性的BrowseComp任務）中實現了高pass@k。基於這些觀察，我們引入了自我搜索強化學習（SSRL），通過基於格式和基於規則的獎勵增強LLMs的自我搜索能力。SSRL使模型能夠在內部迭代優化其知識利用，而無需訪問外部工具。實證評估表明，經過SSRL訓練的策略模型為搜索驅動的RL訓練提供了一個成本效益高且穩定的環境，減少了對外部搜索引擎的依賴，並促進了穩健的模擬到現實的轉移。我們得出以下結論：1）LLMs具備可有效引導以實現高性能的世界知識；2）SSRL展示了利用內部知識減少幻覺的潛力；3）經過SSRL訓練的模型無需額外努力即可與外部搜索引擎無縫集成。我們的研究結果凸顯了LLMs在支持更具可擴展性的RL代理訓練方面的潛力。

English

We investigate the potential of large language models (LLMs) to serve as efficient simulators for agentic search tasks in reinforcement learning (RL), thereby reducing dependence on costly interactions with external search engines. To this end, we first quantify the intrinsic search capability of LLMs via structured prompting and repeated sampling, which we term Self-Search. Our results reveal that LLMs exhibit strong scaling behavior with respect to the inference budget, achieving high pass@k on question-answering benchmarks, including the challenging BrowseComp task. Building on these observations, we introduce Self-Search RL (SSRL), which enhances LLMs' Self-Search capability through format-based and rule-based rewards. SSRL enables models to iteratively refine their knowledge utilization internally, without requiring access to external tools. Empirical evaluations demonstrate that SSRL-trained policy models provide a cost-effective and stable environment for search-driven RL training, reducing reliance on external search engines and facilitating robust sim-to-real transfer. We draw the following conclusions: 1) LLMs possess world knowledge that can be effectively elicited to achieve high performance; 2) SSRL demonstrates the potential of leveraging internal knowledge to reduce hallucination; 3) SSRL-trained models integrate seamlessly with external search engines without additional effort. Our findings highlight the potential of LLMs to support more scalable RL agent training.

自搜索強化學習：SSRL

SSRL: Self-Search Reinforcement Learning

摘要

Support