LiSA：透過保守策略歸納實現終身安全適應

摘要

隨著AI代理從對話介面轉向可讀取私人資料、呼叫工具並執行多步驟工作流程的系統，護欄成為了防止具體部署危害的最後一道防線。在這些情境中，護欄失效不再僅僅是回答品質錯誤：它們可能洩漏機密、授權不安全操作，或阻礙合法工作。最難處理的失效往往是情境相關的：某項行動是否可接受取決於當地隱私規範、組織政策以及使用者預期，而這些因素在部署前難以明確規範。這導致了一個實際落差：護欄必須適應自身的運作環境，然而部署回饋通常僅限於稀疏且帶雜訊的使用者回報失效，且重複微調往往不可行。為解決此落差，我們提出LiSA（終身安全適應），這是一個保守的政策歸納框架，透過結構化記憶來改進固定的基礎護欄。LiSA將偶發失效轉化為可重複使用的政策抽象，使稀疏回報能泛化至個別案例之外；加入衝突感知的局部規則以防止混合標籤情境中的過度泛化；並透過後驗下界應用證據感知的信心門控，使記憶重複使用能隨累積證據而非僅憑經驗準確率擴展。在PrivacyLens+、ConFaide+與AgentHarm上，LiSA在稀疏回饋下持續優於強大的基於記憶的基準方法，即使在20%標籤翻轉率的雜訊使用者回饋下仍保持穩健，並將延遲-效能前沿推至骨幹模型規模化之上。最終，LiSA提供了一條務實的路徑，以保護AI代理免受現實世界邊際風險中不可預測的長尾問題之害。

English

As AI agents move from chat interfaces to systems that read private data, call tools, and execute multi-step workflows, guardrails become a last line of defense against concrete deployment harms. In these settings, guardrail failures are no longer merely answer-quality errors: they can leak secrets, authorize unsafe actions, or block legitimate work. The hardest failures are often contextual: whether an action is acceptable depends on local privacy norms, organizational policies, and user expectations that resist pre-deployment specification. This creates a practical gap: guardrails must adapt to their own operating environments, yet deployment feedback is typically limited to sparse, noisy user-reported failures, and repeated fine-tuning is often impractical. To address this gap, we propose LiSA (Lifelong Safety Adaptation), a conservative policy induction framework that improves a fixed base guardrail through structured memory. LiSA converts occasional failures into reusable policy abstractions so that sparse reports can generalize beyond individual cases, adds conflict-aware local rules to prevent overgeneralization in mixed-label contexts, and applies evidence-aware confidence gating via a posterior lower bound, so that memory reuse scales with accumulated evidence rather than empirical accuracy alone. Across PrivacyLens+, ConFaide+, and AgentHarm, LiSA consistently outperforms strong memory-based baselines under sparse feedback, remains robust under noisy user feedback even at 20% label-flip rates, and pushes the latency--performance frontier beyond backbone model scaling. Ultimately, LiSA offers a practical path to secure AI agents against the unpredictable long tail of real-world edge risks.