ChatPaper.aiChatPaper

SSRL:自搜索强化学习

SSRL: Self-Search Reinforcement Learning

August 14, 2025
作者: Yuchen Fan, Kaiyan Zhang, Heng Zhou, Yuxin Zuo, Yanxu Chen, Yu Fu, Xinwei Long, Xuekai Zhu, Che Jiang, Yuchen Zhang, Li Kang, Gang Chen, Cheng Huang, Zhizhou He, Bingning Wang, Lei Bai, Ning Ding, Bowen Zhou
cs.AI

摘要

我们探索了大型语言模型(LLMs)作为强化学习(RL)中智能搜索任务高效模拟器的潜力,从而减少对外部搜索引擎昂贵交互的依赖。为此,我们首先通过结构化提示和重复采样量化了LLMs的固有搜索能力,称之为自我搜索(Self-Search)。研究结果显示,LLMs在推理预算方面展现出显著的扩展性,在问答基准测试中,包括具有挑战性的BrowseComp任务,实现了较高的pass@k指标。基于这些观察,我们引入了自我搜索强化学习(SSRL),通过基于格式和基于规则的奖励机制,增强了LLMs的自我搜索能力。SSRL使得模型能够在内部迭代优化其知识利用,无需访问外部工具。实证评估表明,经过SSRL训练的策略模型为搜索驱动的RL训练提供了一个成本效益高且稳定的环境,降低了对搜索引擎的依赖,促进了从模拟到现实的稳健迁移。我们得出以下结论:1)LLMs具备可有效提取的世界知识,以实现高性能;2)SSRL展示了利用内部知识减少幻觉的潜力;3)经过SSRL训练的模型能够无缝集成外部搜索引擎,无需额外努力。我们的发现凸显了LLMs在支持更具可扩展性的RL代理训练方面的潜力。
English
We investigate the potential of large language models (LLMs) to serve as efficient simulators for agentic search tasks in reinforcement learning (RL), thereby reducing dependence on costly interactions with external search engines. To this end, we first quantify the intrinsic search capability of LLMs via structured prompting and repeated sampling, which we term Self-Search. Our results reveal that LLMs exhibit strong scaling behavior with respect to the inference budget, achieving high pass@k on question-answering benchmarks, including the challenging BrowseComp task. Building on these observations, we introduce Self-Search RL (SSRL), which enhances LLMs' Self-Search capability through format-based and rule-based rewards. SSRL enables models to iteratively refine their knowledge utilization internally, without requiring access to external tools. Empirical evaluations demonstrate that SSRL-trained policy models provide a cost-effective and stable environment for search-driven RL training, reducing reliance on external search engines and facilitating robust sim-to-real transfer. We draw the following conclusions: 1) LLMs possess world knowledge that can be effectively elicited to achieve high performance; 2) SSRL demonstrates the potential of leveraging internal knowledge to reduce hallucination; 3) SSRL-trained models integrate seamlessly with external search engines without additional effort. Our findings highlight the potential of LLMs to support more scalable RL agent training.
PDF944August 18, 2025