ChatPaper.aiChatPaper

深入探究反致更深傷害

Deep Research Brings Deeper Harm

October 13, 2025
作者: Shuo Chen, Zonggen Li, Zhen Han, Bailan He, Tong Liu, Haokun Chen, Georg Groh, Philip Torr, Volker Tresp, Jindong Gu
cs.AI

摘要

基於大型語言模型(LLMs)構建的深度研究(DR)代理能夠通過分解任務、檢索線上信息並綜合詳細報告來執行複雜的多步驟研究。然而,此類強大能力的濫用可能帶來更大的風險,這在生物安全等高風險與知識密集型領域尤為令人擔憂,因為DR代理可能生成包含詳細禁制知識的專業報告。遺憾的是,我們在實踐中已發現此類風險:僅提交一個獨立LLM直接拒絕的有害查詢,即可誘導DR代理產出詳盡且危險的報告。這凸顯了風險的升級,並強調了進行更深入安全分析的必要性。然而,針對LLMs設計的越獄方法在揭示此類獨特風險方面存在不足,因為它們並未針對DR代理的研究能力。為填補這一空白,我們提出了兩種新穎的越獄策略:計劃注入(Plan Injection),將惡意子目標注入代理的計劃中;以及意圖劫持(Intent Hijack),將有害查詢重新框架為學術研究問題。我們在不同LLMs及多種安全基準(包括通用及生物安全禁制提示)上進行了廣泛實驗,揭示了三個關鍵發現:(1) LLMs的對齊在DR代理中常失效,以學術術語包裝的有害提示可劫持代理意圖;(2) 多步驟規劃與執行削弱了對齊,暴露出系統性漏洞,提示級別的安全措施無法應對;(3) 與獨立LLMs相比,DR代理不僅繞過拒絕,還產出更連貫、專業且危險的內容。這些結果表明DR代理存在根本性的對齊失調,呼籲開發針對DR代理的更好對齊技術。代碼與數據集可於https://chenxshuo.github.io/deeper-harm獲取。
English
Deep Research (DR) agents built on Large Language Models (LLMs) can perform complex, multi-step research by decomposing tasks, retrieving online information, and synthesizing detailed reports. However, the misuse of LLMs with such powerful capabilities can lead to even greater risks. This is especially concerning in high-stakes and knowledge-intensive domains such as biosecurity, where DR can generate a professional report containing detailed forbidden knowledge. Unfortunately, we have found such risks in practice: simply submitting a harmful query, which a standalone LLM directly rejects, can elicit a detailed and dangerous report from DR agents. This highlights the elevated risks and underscores the need for a deeper safety analysis. Yet, jailbreak methods designed for LLMs fall short in exposing such unique risks, as they do not target the research ability of DR agents. To address this gap, we propose two novel jailbreak strategies: Plan Injection, which injects malicious sub-goals into the agent's plan; and Intent Hijack, which reframes harmful queries as academic research questions. We conducted extensive experiments across different LLMs and various safety benchmarks, including general and biosecurity forbidden prompts. These experiments reveal 3 key findings: (1) Alignment of the LLMs often fail in DR agents, where harmful prompts framed in academic terms can hijack agent intent; (2) Multi-step planning and execution weaken the alignment, revealing systemic vulnerabilities that prompt-level safeguards cannot address; (3) DR agents not only bypass refusals but also produce more coherent, professional, and dangerous content, compared with standalone LLMs. These results demonstrate a fundamental misalignment in DR agents and call for better alignment techniques tailored to DR agents. Code and datasets are available at https://chenxshuo.github.io/deeper-harm.
PDF12October 15, 2025