ChatPaper.aiChatPaper

代理强化学习在搜索领域中的应用存在安全隐患

Agentic Reinforcement Learning for Search is Unsafe

October 20, 2025
作者: Yushi Yang, Shreyansh Padarha, Andrew Lee, Adam Mahdi
cs.AI

摘要

代理強化學習(Agentic Reinforcement Learning, RL)訓練大型語言模型在推理過程中自主調用工具,其中搜索是最常見的應用。這些模型在多步驟推理任務中表現出色,但其安全性特性尚未得到充分理解。在本研究中,我們發現經過RL訓練的搜索模型繼承了指令微調中的拒絕行為,並經常通過將有害請求轉化為安全查詢來規避風險。然而,這種安全性是脆弱的。兩種簡單的攻擊——一種強制模型以搜索開始回應(搜索攻擊),另一種鼓勵模型重複搜索(多重搜索攻擊)——會引發一系列有害搜索和答案的連鎖反應。在兩個模型家族(Qwen, Llama)中,無論是本地搜索還是網絡搜索,這些攻擊使拒絕率降低最多達60.0%,答案安全性降低82.5%,搜索查詢安全性降低82.4%。攻擊之所以成功,是因為它們觸發模型在生成繼承的拒絕令牌之前,先產生有害且反映請求的搜索查詢。這暴露了當前RL訓練的核心弱點:它獎勵持續生成有效的查詢,而不考慮其危害性。因此,RL搜索模型存在用戶易於利用的漏洞,迫切需要開發以安全為導向的代理RL流程,以優化安全搜索。
English
Agentic reinforcement learning (RL) trains large language models to autonomously call tools during reasoning, with search as the most common application. These models excel at multi-step reasoning tasks, but their safety properties are not well understood. In this study, we show that RL-trained search models inherit refusal from instruction tuning and often deflect harmful requests by turning them into safe queries. However, this safety is fragile. Two simple attacks, one that forces the model to begin response with search (Search attack), another that encourages models to repeatedly search (Multi-search attack), trigger cascades of harmful searches and answers. Across two model families (Qwen, Llama) with both local and web search, these attacks lower refusal rates by up to 60.0%, answer safety by 82.5%, and search-query safety by 82.4%. The attacks succeed by triggering models to generate harmful, request-mirroring search queries before they can generate the inherited refusal tokens. This exposes a core weakness of current RL training: it rewards continued generation of effective queries without accounting for their harmfulness. As a result, RL search models have vulnerabilities that users can easily exploit, making it urgent to develop safety-aware agentic RL pipelines optimising for safe search.
PDF42October 21, 2025