ChatPaper.aiChatPaper

深度研究带来更深层的危害

Deep Research Brings Deeper Harm

October 13, 2025
作者: Shuo Chen, Zonggen Li, Zhen Han, Bailan He, Tong Liu, Haokun Chen, Georg Groh, Philip Torr, Volker Tresp, Jindong Gu
cs.AI

摘要

基于大型语言模型(LLMs)构建的深度研究(DR)代理能够通过任务分解、在线信息检索及综合详细报告来执行复杂、多步骤的研究工作。然而,如此强大能力的LLMs若被滥用,将带来更大的风险。这在生物安全等高风险、知识密集型领域尤为令人担忧,因为DR代理可能生成包含详细禁限知识的专业报告。不幸的是,我们在实践中已发现此类风险:仅提交一个有害查询,即便独立LLM直接拒绝,也可能从DR代理处引出一份详尽且危险的报告。这凸显了风险的升级,并强调了进行更深层次安全分析的必要性。然而,针对LLMs设计的越狱方法在揭示此类独特风险方面存在不足,因为它们并未针对DR代理的研究能力。为填补这一空白,我们提出了两种新颖的越狱策略:计划注入(Plan Injection),将恶意子目标注入代理的计划中;以及意图劫持(Intent Hijack),将有害查询重新包装为学术研究问题。我们在不同LLMs及多种安全基准上进行了广泛实验,包括通用和生物安全禁限提示。这些实验揭示了三个关键发现:(1) LLMs的对齐在DR代理中常常失效,以学术术语包装的有害提示可劫持代理意图;(2) 多步骤规划与执行削弱了对齐,暴露出系统级漏洞,提示层面的防护措施无法应对;(3) 与独立LLMs相比,DR代理不仅绕过拒绝,还能生成更为连贯、专业且危险的内容。这些结果表明DR代理存在根本性的对齐偏差,呼吁开发更适合DR代理的对齐技术。代码与数据集可在https://chenxshuo.github.io/deeper-harm获取。
English
Deep Research (DR) agents built on Large Language Models (LLMs) can perform complex, multi-step research by decomposing tasks, retrieving online information, and synthesizing detailed reports. However, the misuse of LLMs with such powerful capabilities can lead to even greater risks. This is especially concerning in high-stakes and knowledge-intensive domains such as biosecurity, where DR can generate a professional report containing detailed forbidden knowledge. Unfortunately, we have found such risks in practice: simply submitting a harmful query, which a standalone LLM directly rejects, can elicit a detailed and dangerous report from DR agents. This highlights the elevated risks and underscores the need for a deeper safety analysis. Yet, jailbreak methods designed for LLMs fall short in exposing such unique risks, as they do not target the research ability of DR agents. To address this gap, we propose two novel jailbreak strategies: Plan Injection, which injects malicious sub-goals into the agent's plan; and Intent Hijack, which reframes harmful queries as academic research questions. We conducted extensive experiments across different LLMs and various safety benchmarks, including general and biosecurity forbidden prompts. These experiments reveal 3 key findings: (1) Alignment of the LLMs often fail in DR agents, where harmful prompts framed in academic terms can hijack agent intent; (2) Multi-step planning and execution weaken the alignment, revealing systemic vulnerabilities that prompt-level safeguards cannot address; (3) DR agents not only bypass refusals but also produce more coherent, professional, and dangerous content, compared with standalone LLMs. These results demonstrate a fundamental misalignment in DR agents and call for better alignment techniques tailored to DR agents. Code and datasets are available at https://chenxshuo.github.io/deeper-harm.
PDF12October 15, 2025