从提示注入到持久控制：防御智能体控制系统的特洛伊木马后门

摘要

LLM智能体正从对话式聊天机器人演化为实际工作空间中的操作工具。在本地智能体框架中，LLM能够读写文件、调用工具，并在会话间复用工作空间状态。这类能力虽提升了实用性，却也为攻击者开辟了新的攻击面。攻击者可将提示注入嵌入文件或工具输出中，智能体可能读取这些隐藏指令并加以存储，待后续执行。在这种多步骤木马攻击范式中，单个步骤看似无害，但组合起来便能使不可信文本转化为持续性操控内容。现有防御机制往往孤立检测每个步骤，虽能拦截显式有害行为，却无法检测埋设后门的早期写入操作。为揭示此威胁，我们提出ClawTrojan基准测试——专为识别本地智能体框架中的多步骤木马攻击而设计。在基于GPT-5.4的OpenClaw仿真工作空间中，ClawTrojan的攻击成功率(ASR)达95.5%，而现有单轮提示注入攻击在相同模型上的ASR近乎为零。针对该威胁，我们提出DASGuard防御方案，通过扫描敏感本地文件中的控制类文本、追溯其来源，并移除非可信源头的控制内容。实验表明，DASGuard通过运行时攻击拦截与工作空间净化提交的结合，实现了强有力的动态防御。

English

LLM agents are evolving from conversational chatbots to operational tools in real-world workspaces. In local agentic harnesses, an LLM can read and write files, call tools, and reuse workspace state across sessions. While such capabilities enhance utility, they also expose a new attack surface for attackers. Attackers can embed a prompt injection within a file or tool output. Agents may read this hidden instruction, store it, and execute it later. In this multi-step trojan attack paradigm, no individual step appears malicious on its own, but these steps can collectively turn untrusted text into persistent control content. However, existing defenses often inspect each step in isolation. As a result, they can block a clear harmful action, but fail to detect the earlier write operation that plants the backdoor. To reveal this threat, we introduce ClawTrojan, a benchmark designed to identify multi-step trojan attacks in local agentic harnesses. In an OpenClaw-style simulated workspace with GPT-5.4, ClawTrojan reaches a 95.5% attack success rate (ASR), while existing single-turn prompt-injection attacks produce near-zero ASR on the same model. To address this threat, we propose DASGuard, which scans control-like text in sensitive local files, traces its origin, and removes control content that does not originate from a trusted source. Our results show that DASGuard achieves strong dynamic defense by combining runtime attack blocking with sanitized commits to the workspace.