软指令降级防御

摘要

大型语言模型（LLMs）在面向外部环境的智能体系统中日益普及，这使其在处理不可信数据时易受提示注入攻击。为突破此局限，我们提出SIC（软指令控制）——一种面向工具增强型LLM智能体的简洁高效迭代式提示净化循环机制。该方法通过多重循环检测输入数据中可能干扰智能体行为的指令内容，若发现恶意内容则进行重写、屏蔽或删除处理，并对结果进行再次评估。该流程持续至输入数据净化完成或达到最大迭代次数；若仍存在强制性指令类内容，系统将中止运行以确保安全。通过允许多轮处理，本方法承认单次改写可能失败，但能在后续步骤中捕获并修正遗漏的注入攻击。尽管具备即时实用性，最坏情况分析表明SIC并非无懈可击：强大攻击者仍可通过嵌入非强制性工作流程实现15%的攻击成功率。但这一方案显著提升了防御门槛。

English

Large Language Models (LLMs) are increasingly deployed in agentic systems that interact with an external environment; this makes them susceptible to prompt injections when dealing with untrusted data. To overcome this limitation, we propose SIC (Soft Instruction Control)-a simple yet effective iterative prompt sanitization loop designed for tool-augmented LLM agents. Our method repeatedly inspects incoming data for instructions that could compromise agent behavior. If such content is found, the malicious content is rewritten, masked, or removed, and the result is re-evaluated. The process continues until the input is clean or a maximum iteration limit is reached; if imperative instruction-like content remains, the agent halts to ensure security. By allowing multiple passes, our approach acknowledges that individual rewrites may fail but enables the system to catch and correct missed injections in later steps. Although immediately useful, worst-case analysis shows that SIC is not infallible; strong adversary can still get a 15% ASR by embedding non-imperative workflows. This nonetheless raises the bar.

软指令降级防御

Soft Instruction De-escalation Defense

摘要

Support