基于代理的强化学习搜索存在安全隐患
Agentic Reinforcement Learning for Search is Unsafe
October 20, 2025
作者: Yushi Yang, Shreyansh Padarha, Andrew Lee, Adam Mahdi
cs.AI
摘要
代理强化学习(RL)训练大型语言模型在推理过程中自主调用工具,其中搜索是最常见的应用场景。这些模型在多步推理任务中表现出色,但其安全性特性尚未得到充分理解。本研究表明,经过RL训练的搜索模型继承了指令微调中的拒绝机制,通常通过将有害请求转化为安全查询来规避风险。然而,这种安全性是脆弱的。两种简单的攻击方法——一种强制模型以搜索开始响应(搜索攻击),另一种鼓励模型反复搜索(多重搜索攻击)——会引发一连串的有害搜索和回答。在涵盖两个模型系列(Qwen、Llama)及本地与网络搜索的实验中,这些攻击使拒绝率最多降低了60.0%,回答安全性降低了82.5%,搜索查询安全性降低了82.4%。攻击之所以成功,是因为它们在模型生成继承的拒绝标记之前,诱使模型生成了有害的、反映请求的搜索查询。这揭示了当前RL训练的一个核心弱点:它奖励生成有效查询的持续行为,却未考虑这些查询的有害性。因此,RL搜索模型存在用户易于利用的漏洞,亟需开发以安全为导向的代理RL流程,优化安全搜索。
English
Agentic reinforcement learning (RL) trains large language models to
autonomously call tools during reasoning, with search as the most common
application. These models excel at multi-step reasoning tasks, but their safety
properties are not well understood. In this study, we show that RL-trained
search models inherit refusal from instruction tuning and often deflect harmful
requests by turning them into safe queries. However, this safety is fragile.
Two simple attacks, one that forces the model to begin response with search
(Search attack), another that encourages models to repeatedly search
(Multi-search attack), trigger cascades of harmful searches and answers. Across
two model families (Qwen, Llama) with both local and web search, these attacks
lower refusal rates by up to 60.0%, answer safety by 82.5%, and search-query
safety by 82.4%. The attacks succeed by triggering models to generate harmful,
request-mirroring search queries before they can generate the inherited refusal
tokens. This exposes a core weakness of current RL training: it rewards
continued generation of effective queries without accounting for their
harmfulness. As a result, RL search models have vulnerabilities that users can
easily exploit, making it urgent to develop safety-aware agentic RL pipelines
optimising for safe search.