ChatPaper.aiChatPaper

LiSA:透過保守策略歸納實現終身安全適應

LiSA: Lifelong Safety Adaptation via Conservative Policy Induction

May 14, 2026
作者: Minbeom Kim, Lesly Miculicich, Bhavana Dalvi Mishra, Mihir Parmar, Phillip Wallis, Bharath Chandrasekhar, Kyomin Jung, Tomas Pfister, Long T. Le
cs.AI

摘要

隨著AI代理從對話介面轉向可讀取私人資料、呼叫工具並執行多步驟工作流程的系統,護欄成為了防止具體部署危害的最後一道防線。在這些情境中,護欄失效不再僅僅是回答品質錯誤:它們可能洩漏機密、授權不安全操作,或阻礙合法工作。最難處理的失效往往是情境相關的:某項行動是否可接受取決於當地隱私規範、組織政策以及使用者預期,而這些因素在部署前難以明確規範。這導致了一個實際落差:護欄必須適應自身的運作環境,然而部署回饋通常僅限於稀疏且帶雜訊的使用者回報失效,且重複微調往往不可行。為解決此落差,我們提出LiSA(終身安全適應),這是一個保守的政策歸納框架,透過結構化記憶來改進固定的基礎護欄。LiSA將偶發失效轉化為可重複使用的政策抽象,使稀疏回報能泛化至個別案例之外;加入衝突感知的局部規則以防止混合標籤情境中的過度泛化;並透過後驗下界應用證據感知的信心門控,使記憶重複使用能隨累積證據而非僅憑經驗準確率擴展。在PrivacyLens+、ConFaide+與AgentHarm上,LiSA在稀疏回饋下持續優於強大的基於記憶的基準方法,即使在20%標籤翻轉率的雜訊使用者回饋下仍保持穩健,並將延遲-效能前沿推至骨幹模型規模化之上。最終,LiSA提供了一條務實的路徑,以保護AI代理免受現實世界邊際風險中不可預測的長尾問題之害。
English
As AI agents move from chat interfaces to systems that read private data, call tools, and execute multi-step workflows, guardrails become a last line of defense against concrete deployment harms. In these settings, guardrail failures are no longer merely answer-quality errors: they can leak secrets, authorize unsafe actions, or block legitimate work. The hardest failures are often contextual: whether an action is acceptable depends on local privacy norms, organizational policies, and user expectations that resist pre-deployment specification. This creates a practical gap: guardrails must adapt to their own operating environments, yet deployment feedback is typically limited to sparse, noisy user-reported failures, and repeated fine-tuning is often impractical. To address this gap, we propose LiSA (Lifelong Safety Adaptation), a conservative policy induction framework that improves a fixed base guardrail through structured memory. LiSA converts occasional failures into reusable policy abstractions so that sparse reports can generalize beyond individual cases, adds conflict-aware local rules to prevent overgeneralization in mixed-label contexts, and applies evidence-aware confidence gating via a posterior lower bound, so that memory reuse scales with accumulated evidence rather than empirical accuracy alone. Across PrivacyLens+, ConFaide+, and AgentHarm, LiSA consistently outperforms strong memory-based baselines under sparse feedback, remains robust under noisy user feedback even at 20% label-flip rates, and pushes the latency--performance frontier beyond backbone model scaling. Ultimately, LiSA offers a practical path to secure AI agents against the unpredictable long tail of real-world edge risks.