從提示注入到持續控制:防禦代理系統中的木馬後門
From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors
May 29, 2026
作者: Jiejun Tan, Zhicheng Dou, Xinyu Yang, Yuyang Hu, Yiruo Cheng, Xiaoxi Li, Ji-Rong Wen
cs.AI
摘要
LLM代理正從對話式聊天機器人演進為真實工作空間中的操作工具。在本地代理框架中,LLM可以讀寫檔案、呼叫工具,並跨工作階段重複使用工作空間狀態。雖然這些能力增強了實用性,但也為攻擊者暴露了新的攻擊面。攻擊者可以在檔案或工具輸出中嵌入提示注入。代理可能會讀取這個隱藏指令,將其儲存,並在後續執行。在這種多步驟木馬攻擊範式中,沒有任何單一步驟本身看似惡意,但這些步驟可以共同將不受信任的文字轉化為持久的控制內容。然而,現有的防禦措施通常孤立地檢查每個步驟。因此,它們可以阻擋明顯的有害行為,但無法檢測到植入後門的早期寫入操作。為了揭露這種威脅,我們引入了ClawTrojan,這是一個旨在識別本地代理框架中多步驟木馬攻擊的基準測試。在一個基於OpenClaw風格的模擬工作空間中,搭配GPT-5.4,ClawTrojan達到了95.5%的攻擊成功率,而現有的單輪提示注入攻擊在同一模型上的攻擊成功率接近零。為了解決這一威脅,我們提出了DASGuard,它掃描敏感本地檔案中的控制類文字,追蹤其來源,並移除非來自可信來源的控制內容。我們的結果顯示,DASGuard通過結合執行時攻擊阻擋與對工作空間的清理提交,實現了強大的動態防禦。
English
LLM agents are evolving from conversational chatbots to operational tools in real-world workspaces. In local agentic harnesses, an LLM can read and write files, call tools, and reuse workspace state across sessions. While such capabilities enhance utility, they also expose a new attack surface for attackers. Attackers can embed a prompt injection within a file or tool output. Agents may read this hidden instruction, store it, and execute it later. In this multi-step trojan attack paradigm, no individual step appears malicious on its own, but these steps can collectively turn untrusted text into persistent control content. However, existing defenses often inspect each step in isolation. As a result, they can block a clear harmful action, but fail to detect the earlier write operation that plants the backdoor. To reveal this threat, we introduce ClawTrojan, a benchmark designed to identify multi-step trojan attacks in local agentic harnesses. In an OpenClaw-style simulated workspace with GPT-5.4, ClawTrojan reaches a 95.5% attack success rate (ASR), while existing single-turn prompt-injection attacks produce near-zero ASR on the same model. To address this threat, we propose DASGuard, which scans control-like text in sensitive local files, traces its origin, and removes control content that does not originate from a trusted source. Our results show that DASGuard achieves strong dynamic defense by combining runtime attack blocking with sanitized commits to the workspace.