CausalArmor: Efficiënte Beveiligingsmaatregelen tegen Indirecte Prompt Injecties via Causale Attributie

Samenvatting

AI-agents met tool-aanroepmogelijkheden zijn vatbaar voor Indirecte Prompt Injectie (IPI) aanvallen. In dit aanvalsscenario misleiden kwaadaardige commando's, verborgen in niet-vertrouwde content, de agent om onbevoegde acties uit te voeren. Bestaande verdedigingen kunnen het aanvalssucces verminderen, maar lijden vaak onder het oververdedigingsdilemma: ze zetten kostbare, altijd-actieve sanitisatie in, ongeacht de werkelijke dreiging, wat het nut en de latentie aantast, zelfs in goedaardige scenario's. Wij herbezien IPI vanuit een causaal ablatieperspectief: een succesvolle injectie manifesteert zich als een dominantieverschuiving waarbij het gebruikersverzoek niet langer doorslaggevende steun biedt voor de geprivilegieerde actie van de agent, terwijl een specifiek niet-vertrouwd segment, zoals een opgehaald document of tool-output, een disproportioneel toerekenbare invloed uitoefent. Gebaseerd op deze signatuur stellen wij CausalArmor voor, een selectief verdedigingskader dat (i) lichtgewicht, op leave-one-out-ablatie gebaseerde attributies berekent op geprivilegieerde beslispunten, en (ii) gerichte sanitisatie activeert alleen wanneer een niet-vertrouwd segment de gebruikersintentie domineert. Daarnaast gebruikt CausalArmor retroactieve Chain-of-Thought-masking om te voorkomen dat de agent handelt op basis van 'vergiftigde' redeneersporen. Wij presenteren een theoretische analyse die aantoont dat sanitisatie gebaseerd op attributiemarges onder voorwaarden een exponentieel kleine bovengrens oplevert voor de waarschijnlijkheid van het selecteren van kwaadaardige acties. Experimenten op AgentDojo en DoomArena tonen aan dat CausalArmor de beveiliging van agressieve verdedigingen evenaart, terwijl het de verklaarbaarheid verbetert en het nut en de latentie van AI-agents behoudt.

English

AI agents equipped with tool-calling capabilities are susceptible to Indirect Prompt Injection (IPI) attacks. In this attack scenario, malicious commands hidden within untrusted content trick the agent into performing unauthorized actions. Existing defenses can reduce attack success but often suffer from the over-defense dilemma: they deploy expensive, always-on sanitization regardless of actual threat, thereby degrading utility and latency even in benign scenarios. We revisit IPI through a causal ablation perspective: a successful injection manifests as a dominance shift where the user request no longer provides decisive support for the agent's privileged action, while a particular untrusted segment, such as a retrieved document or tool output, provides disproportionate attributable influence. Based on this signature, we propose CausalArmor, a selective defense framework that (i) computes lightweight, leave-one-out ablation-based attributions at privileged decision points, and (ii) triggers targeted sanitization only when an untrusted segment dominates the user intent. Additionally, CausalArmor employs retroactive Chain-of-Thought masking to prevent the agent from acting on ``poisoned'' reasoning traces. We present a theoretical analysis showing that sanitization based on attribution margins conditionally yields an exponentially small upper bound on the probability of selecting malicious actions. Experiments on AgentDojo and DoomArena demonstrate that CausalArmor matches the security of aggressive defenses while improving explainability and preserving utility and latency of AI agents.

CausalArmor: Efficiënte Beveiligingsmaatregelen tegen Indirecte Prompt Injecties via Causale Attributie

CausalArmor: Efficient Indirect Prompt Injection Guardrails via Causal Attribution

Samenvatting

Support