因果护甲:基于因果归因的高效间接提示注入防护框架
CausalArmor: Efficient Indirect Prompt Injection Guardrails via Causal Attribution
February 8, 2026
作者: Minbeom Kim, Mihir Parmar, Phillip Wallis, Lesly Miculicich, Kyomin Jung, Krishnamurthy Dj Dvijotham, Long T. Le, Tomas Pfister
cs.AI
摘要
具备工具调用能力的AI智能体易受间接提示注入攻击。在此类攻击场景中,隐藏在不可信内容中的恶意指令会诱使智能体执行未授权操作。现有防御方案虽能降低攻击成功率,但常陷入过度防御困境:无论实际威胁是否存在都持续采用高成本的净化处理,导致在正常场景下也牺牲了系统效用与响应速度。我们通过因果消融视角重新审视间接提示注入攻击:成功的注入表现为控制权转移——用户请求不再对智能体的特权行为产生决定性影响,而某个不可信片段(如检索文档或工具输出)却产生不成比例的归因影响。基于此特征,我们提出CausalArmor选择性防御框架,其具备双重机制:(i)在特权决策点计算基于留一法的轻量级归因分析;(ii)仅当不可信片段主导用户意图时才触发精准净化。此外,CausalArmor采用回溯式思维链掩码技术,防止智能体基于"中毒"推理轨迹执行操作。理论分析表明,基于归因边际的净化处理能使恶意动作选择概率的条件上界呈指数级缩小。在AgentDojo和DoomArena平台上的实验证明,CausalArmor在保持激进防御方案安全性的同时,显著提升了可解释性,并维护了AI智能体的实用性与响应效率。
English
AI agents equipped with tool-calling capabilities are susceptible to Indirect Prompt Injection (IPI) attacks. In this attack scenario, malicious commands hidden within untrusted content trick the agent into performing unauthorized actions. Existing defenses can reduce attack success but often suffer from the over-defense dilemma: they deploy expensive, always-on sanitization regardless of actual threat, thereby degrading utility and latency even in benign scenarios. We revisit IPI through a causal ablation perspective: a successful injection manifests as a dominance shift where the user request no longer provides decisive support for the agent's privileged action, while a particular untrusted segment, such as a retrieved document or tool output, provides disproportionate attributable influence. Based on this signature, we propose CausalArmor, a selective defense framework that (i) computes lightweight, leave-one-out ablation-based attributions at privileged decision points, and (ii) triggers targeted sanitization only when an untrusted segment dominates the user intent. Additionally, CausalArmor employs retroactive Chain-of-Thought masking to prevent the agent from acting on ``poisoned'' reasoning traces. We present a theoretical analysis showing that sanitization based on attribution margins conditionally yields an exponentially small upper bound on the probability of selecting malicious actions. Experiments on AgentDojo and DoomArena demonstrate that CausalArmor matches the security of aggressive defenses while improving explainability and preserving utility and latency of AI agents.