ChatPaper.aiChatPaper

柔性指令降级防御

Soft Instruction De-escalation Defense

October 24, 2025
作者: Nils Philipp Walter, Chawin Sitawarin, Jamie Hayes, David Stutz, Ilia Shumailov
cs.AI

摘要

大型语言模型(LLMs)正日益频繁地被部署于与外部环境交互的智能体系统中,这使得其在处理不可信数据时容易受到提示注入攻击。为突破此局限,我们提出SIC(软指令控制)——一种面向工具增强型LLM智能体的简易而高效的迭代式提示净化循环机制。该方法通过多重循环检测输入数据中可能破坏智能体行为的指令内容,若发现恶意内容则进行重写、屏蔽或删除操作,并对处理结果进行再次评估。该流程将持续至输入内容被完全净化或达到最大迭代次数;若仍有强制性指令类内容残留,智能体会终止运行以确保安全。通过允许多轮次处理,我们的方法认识到单次重写可能失败,但系统能在后续步骤中捕获并修正遗漏的注入攻击。尽管SIC具有即时实用性,但最坏情况分析表明其并非无懈可击——强大攻击者仍可通过嵌入非强制性工作流程实现15%的攻击成功率。尽管如此,该技术显著提升了安全防护门槛。
English
Large Language Models (LLMs) are increasingly deployed in agentic systems that interact with an external environment; this makes them susceptible to prompt injections when dealing with untrusted data. To overcome this limitation, we propose SIC (Soft Instruction Control)-a simple yet effective iterative prompt sanitization loop designed for tool-augmented LLM agents. Our method repeatedly inspects incoming data for instructions that could compromise agent behavior. If such content is found, the malicious content is rewritten, masked, or removed, and the result is re-evaluated. The process continues until the input is clean or a maximum iteration limit is reached; if imperative instruction-like content remains, the agent halts to ensure security. By allowing multiple passes, our approach acknowledges that individual rewrites may fail but enables the system to catch and correct missed injections in later steps. Although immediately useful, worst-case analysis shows that SIC is not infallible; strong adversary can still get a 15% ASR by embedding non-imperative workflows. This nonetheless raises the bar.
PDF41December 17, 2025