세션 위험 메모리(SRM): 결정론적 사전 실행 안전 게이트를 위한 시간적 인가

초록

결정론적 사전 실행 안전 게이트는 개별 에이전트 행동이 할당된 역할과 호환되는지 평가한다. 이러한 시스템은 행동 단위 승인에는 효과적이지만, 유해한 의도를 여러 개별적으로 준수하는 단계로 분해하는 분산형 공격에는 구조적으로 무감각하다. 본 논문은 상태 비저장 실행 게이트를 궤적 수준 승인으로 확장하는 경량 결정론적 모듈인 세션 위험 메모리(SRM)를 소개한다. SRM은 에이전트 세션의 진화하는 행동 프로필을 나타내는 간결한 의미론적 중심점을 유지하며, 기준값을 뺀 게이트 출력에 대한 지수 이동평균을 통해 위험 신호를 누적한다. 이는 기본 게이트와 동일한 의미론적 벡터 표현으로 작동하므로 추가 모델 구성 요소, 학습 또는 확률적 추론이 필요 없다. 느린 침투형 정보 유출, 점진적 권한 상승, 준수 이탈 시나리오를 포함하는 80개 세션의 다중 턴 벤치마크에서 SRM을 평가했다. 결과에 따르면 상태 비저장 ILION이 F1 = 0.9756, 5% FPR인 반면, ILION+SRM은 F1 = 1.0000, 0% 오탐률을 달성했으며 두 시스템 모두 100% 탐지율을 유지했다. 중요한 것은 SRM이 턴당 250마이크로초 미만의 오버헤드로 모든 오탐을 제거한다는 점이다. 본 프레임워크는 공간적 승인 일관성(행동 단위 평가)과 시간적 승인 일관성(궤적에 걸쳐 평가) 간의 개념적 구분을 도입하여 에이전트 시스템의 세션 수준 안전을 위한 원칙적 기반을 제공한다.

English

Deterministic pre-execution safety gates evaluate whether individual agent actions are compatible with their assigned roles. While effective at per-action authorization, these systems are structurally blind to distributed attacks that decompose harmful intent across multiple individually-compliant steps. This paper introduces Session Risk Memory (SRM), a lightweight deterministic module that extends stateless execution gates with trajectory-level authorization. SRM maintains a compact semantic centroid representing the evolving behavioral profile of an agent session and accumulates a risk signal through exponential moving average over baseline-subtracted gate outputs. It operates on the same semantic vector representation as the underlying gate, requiring no additional model components, training, or probabilistic inference. We evaluate SRM on a multi-turn benchmark of 80 sessions containing slow-burn exfiltration, gradual privilege escalation, and compliance drift scenarios. Results show that ILION+SRM achieves F1 = 1.0000 with 0% false positive rate, compared to stateless ILION at F1 = 0.9756 with 5% FPR, while maintaining 100% detection rate for both systems. Critically, SRM eliminates all false positives with a per-turn overhead under 250 microseconds. The framework introduces a conceptual distinction between spatial authorization consistency (evaluated per action) and temporal authorization consistency (evaluated over trajectory), providing a principled basis for session-level safety in agentic systems.

세션 위험 메모리(SRM): 결정론적 사전 실행 안전 게이트를 위한 시간적 인가

Session Risk Memory (SRM): Temporal Authorization for Deterministic Pre-Execution Safety Gates

초록

Support