プロンプトインジェクションから永続的制御へ：エージェント型ハーネスをトロイの木馬バックドアから防御する

要旨

LLMエージェントは、会話型チャットボットから実世界のワークスペースにおける運用ツールへと進化している。ローカルエージェンティックハーネスにおいて、LLMはファイルの読み書き、ツールの呼び出し、セッションをまたいだワークスペース状態の再利用が可能である。こうした機能は実用性を高める一方で、攻撃者にとって新たな攻撃対象領域を露呈する。攻撃者はファイルやツールの出力内にプロンプトインジェクションを埋め込むことができる。エージェントはこの隠された命令を読み取り、保存し、後で実行する可能性がある。このマルチステップトロイ攻撃パラダイムでは、個々のステップ自体は悪意があるようには見えないが、これらのステップは総じて信頼できないテキストを永続的な制御コンテンツに変え得る。しかし、既存の防御策は各ステップを個別に検査することが多い。その結果、明らかな有害行為をブロックできても、バックドアを仕込む初期の書き込み操作を検出できない。この脅威を明らかにするため、我々はローカルエージェンティックハーネスにおけるマルチステップトロイ攻撃を特定するベンチマーク、ClawTrojanを導入する。GPT-5.4を用いたOpenClow型シミュレーションワークスペースにおいて、ClawTrojanは95.5%の攻撃成功率（ASR）を達成する一方、既存の単一ターンプロンプトインジェクション攻撃は同一モデルでASRがほぼゼロとなる。この脅威に対処するため、我々はDASGuardを提案する。これは機密性の高いローカルファイル内の制御的なテキストをスキャンし、その出所を追跡し、信頼できるソースに由来しない制御コンテンツを除去する。我々の結果は、DASGuardが実行時の攻撃ブロックとワークスペースへのサニタイズ済みコミットを組み合わせることで、強力な動的防御を実現することを示している。

English

LLM agents are evolving from conversational chatbots to operational tools in real-world workspaces. In local agentic harnesses, an LLM can read and write files, call tools, and reuse workspace state across sessions. While such capabilities enhance utility, they also expose a new attack surface for attackers. Attackers can embed a prompt injection within a file or tool output. Agents may read this hidden instruction, store it, and execute it later. In this multi-step trojan attack paradigm, no individual step appears malicious on its own, but these steps can collectively turn untrusted text into persistent control content. However, existing defenses often inspect each step in isolation. As a result, they can block a clear harmful action, but fail to detect the earlier write operation that plants the backdoor. To reveal this threat, we introduce ClawTrojan, a benchmark designed to identify multi-step trojan attacks in local agentic harnesses. In an OpenClaw-style simulated workspace with GPT-5.4, ClawTrojan reaches a 95.5% attack success rate (ASR), while existing single-turn prompt-injection attacks produce near-zero ASR on the same model. To address this threat, we propose DASGuard, which scans control-like text in sensitive local files, traces its origin, and removes control content that does not originate from a trusted source. Our results show that DASGuard achieves strong dynamic defense by combining runtime attack blocking with sanitized commits to the workspace.