ソフト命令エスカレーション防止防御

要旨

大規模言語モデル（LLMs）は、外部環境と相互作用するエージェントシステムにおいてますます利用されるようになっており、これにより信頼できないデータを扱う際にプロンプトインジェクションの影響を受けやすくなっている。この制限を克服するため、我々はツール拡張型LLMエージェント向けに設計された、シンプルかつ効果的な反復的プロンプトサニタイゼーションループであるSIC（Soft Instruction Control）を提案する。本手法では、入力データを繰り返し検査し、エージェントの動作を危険にさらす可能性のある命令が含まれていないかを確認する。悪意のある内容が検出された場合、その内容は書き換え、マスク、または削除され、結果が再評価される。このプロセスは、入力が安全な状態になるか、最大反復回数に達するまで継続される。必須の命令的な内容が残存する場合、エージェントはセキュリティを確保するために動作を停止する。複数回のパスを許可することにより、個々の書き換え処理が失敗する可能性を認めつつ、システムが後続のステップで見逃されたインジェクションを検出し修正することを可能にする。SICは即時の有用性を持つが、最悪ケース分析によれば本手法も絶対確実ではなく、強力な攻撃者は非必須的なワークフローを埋め込むことで15%の攻撃成功率（ASR）を達成し得る。しかしながら、これはセキュリティのハードルを確実に高めるものである。

English

Large Language Models (LLMs) are increasingly deployed in agentic systems that interact with an external environment; this makes them susceptible to prompt injections when dealing with untrusted data. To overcome this limitation, we propose SIC (Soft Instruction Control)-a simple yet effective iterative prompt sanitization loop designed for tool-augmented LLM agents. Our method repeatedly inspects incoming data for instructions that could compromise agent behavior. If such content is found, the malicious content is rewritten, masked, or removed, and the result is re-evaluated. The process continues until the input is clean or a maximum iteration limit is reached; if imperative instruction-like content remains, the agent halts to ensure security. By allowing multiple passes, our approach acknowledges that individual rewrites may fail but enables the system to catch and correct missed injections in later steps. Although immediately useful, worst-case analysis shows that SIC is not infallible; strong adversary can still get a 15% ASR by embedding non-imperative workflows. This nonetheless raises the bar.

ソフト命令エスカレーション防止防御

Soft Instruction De-escalation Defense

要旨

Support