소프트 명령 비확산 방어

초록

대규모 언어 모델(LLM)은 외부 환경과 상호작용하는 에이전트 시스템에 점점 더 많이 배포되며, 이로 인해 신뢰할 수 없는 데이터를 처리할 때 프롬프트 인젝션에 취약해질 수 있습니다. 이러한 한계를 극복하기 위해 우리는 도구를 활용하는 LLM 에이전트를 위해 설계된 간단하면서 효과적인 반복적 프롬프트 살균 루프인 SIC(Soft Instruction Control)를 제안합니다. 우리의 방법은 유입되는 데이터를 반복적으로 검사하여 에이전트 동작을 손상시킬 수 있는 명령어가 있는지 확인합니다. 이러한 내용이 발견되면 악성 콘텐츠를 재작성, 마스킹 또는 제거한 후 결과를 재평가합니다. 이 프로세스는 입력이 안전해지거나 최대 반복 한도에 도달할 때까지 계속되며, 만약 명령형 명령어 형태의 내용이 남아 있을 경우 보안을 위해 에이전트가 중단됩니다. 다중 패스를 허용함으로써, 우리의 접근 방식은 개별 재작성이 실패할 수 있음을 인정하지만 시스템이 후속 단계에서 놓친 인젝션을 포착하고 수정할 수 있도록 합니다. SIC는 즉각적으로 유용하지만, 최악의 경우 분석에 따르면 SIC도 완벽하지는 않습니다. 강력한 공격자는 비명령형 워크플로를 내장함으로써 여전히 15%의 공격 성공률(ASR)을 달성할 수 있습니다. 그럼에도 불구하고 이는 보안 장벽을 높이는 것입니다.

English

Large Language Models (LLMs) are increasingly deployed in agentic systems that interact with an external environment; this makes them susceptible to prompt injections when dealing with untrusted data. To overcome this limitation, we propose SIC (Soft Instruction Control)-a simple yet effective iterative prompt sanitization loop designed for tool-augmented LLM agents. Our method repeatedly inspects incoming data for instructions that could compromise agent behavior. If such content is found, the malicious content is rewritten, masked, or removed, and the result is re-evaluated. The process continues until the input is clean or a maximum iteration limit is reached; if imperative instruction-like content remains, the agent halts to ensure security. By allowing multiple passes, our approach acknowledges that individual rewrites may fail but enables the system to catch and correct missed injections in later steps. Although immediately useful, worst-case analysis shows that SIC is not infallible; strong adversary can still get a 15% ASR by embedding non-imperative workflows. This nonetheless raises the bar.

소프트 명령 비확산 방어

Soft Instruction De-escalation Defense

초록

Support