LiSA: 보수적 정책 유도를 통한 평생 안전 적응

초록

AI 에이전트가 채팅 인터페이스에서 개인 데이터를 읽고, 도구를 호출하며, 다단계 워크플로우를 실행하는 시스템으로 발전함에 따라, 가드레일(guardrail)은 구체적인 배포 상의 피해에 대한 최후의 방어선이 됩니다. 이러한 환경에서 가드레일 실패는 더 이상 단순한 응답 품질 오류가 아닙니다. 비밀을 유출하거나, 안전하지 않은 작업을 승인하거나, 합법적인 작업을 차단할 수 있습니다. 가장 다루기 어려운 실패는 종종 맥락적입니다. 작업의 수용 가능 여부는 배포 전에 명세화하기 어려운 지역적 프라이버시 규범, 조직 정책, 사용자 기대에 따라 달라집니다. 이는 실질적인 격차를 만듭니다. 가드레일은 자체 운영 환경에 적응해야 하지만, 배포 피드백은 일반적으로 드물고 노이즈가 있는 사용자 보고 실패로 제한되며, 반복적인 미세 조정은 종종 비실용적입니다. 이 격차를 해결하기 위해, 우리는 LiSA(Lifelong Safety Adaptation)를 제안합니다. 이는 구조화된 메모리를 통해 고정된 기본 가드레일을 개선하는 보수적인 정책 유도 프레임워크입니다. LiSA는 드문 실패를 재사용 가능한 정책 추상화로 변환하여 희소한 보고가 개별 사례를 넘어 일반화될 수 있게 하고, 혼합 레이블 맥락에서 과도한 일반화를 방지하기 위해 충돌 인식 지역 규칙을 추가하며, 사후 하한을 통해 증거 인식 신뢰도 게이팅을 적용하여 메모리 재사용이 경험적 정확도만이 아니라 축적된 증거에 따라 확장되도록 합니다. PrivacyLens+, ConFaide+, AgentHarm 데이터셋 전반에 걸쳐, LiSA는 희소 피드백 조건에서 강력한 메모리 기반 기준선을 일관되게 능가하며, 20% 레이블 뒤집기 비율에서도 노이즈가 있는 사용자 피드백 하에서 견고함을 유지하고, 지연 시간-성능 경계를 백본 모델 스케일링 너머로 확장합니다. 궁극적으로 LiSA는 현실 세계의 예측 불가능한 긴 꼬리(long-tail) 에지 위험으로부터 AI 에이전트를 보호하는 실용적인 경로를 제공합니다.

English

As AI agents move from chat interfaces to systems that read private data, call tools, and execute multi-step workflows, guardrails become a last line of defense against concrete deployment harms. In these settings, guardrail failures are no longer merely answer-quality errors: they can leak secrets, authorize unsafe actions, or block legitimate work. The hardest failures are often contextual: whether an action is acceptable depends on local privacy norms, organizational policies, and user expectations that resist pre-deployment specification. This creates a practical gap: guardrails must adapt to their own operating environments, yet deployment feedback is typically limited to sparse, noisy user-reported failures, and repeated fine-tuning is often impractical. To address this gap, we propose LiSA (Lifelong Safety Adaptation), a conservative policy induction framework that improves a fixed base guardrail through structured memory. LiSA converts occasional failures into reusable policy abstractions so that sparse reports can generalize beyond individual cases, adds conflict-aware local rules to prevent overgeneralization in mixed-label contexts, and applies evidence-aware confidence gating via a posterior lower bound, so that memory reuse scales with accumulated evidence rather than empirical accuracy alone. Across PrivacyLens+, ConFaide+, and AgentHarm, LiSA consistently outperforms strong memory-based baselines under sparse feedback, remains robust under noisy user feedback even at 20% label-flip rates, and pushes the latency--performance frontier beyond backbone model scaling. Ultimately, LiSA offers a practical path to secure AI agents against the unpredictable long tail of real-world edge risks.