LiSA: 保守的政策誘導による生涯安全適応

要旨

AIエージェントがチャットインターフェースから、個人データの読み取り、ツールの呼び出し、複数ステップのワークフローの実行を行うシステムへと移行するにつれて、ガードレールは具体的なデプロイ上の害悪に対する最後の防御線となる。このような状況では、ガードレールの障害は単なる回答品質の誤りではなくなる。すなわち、秘密情報の漏洩、安全でない動作の許可、または正当な作業の妨害を引き起こしうる。最も困難な障害は、しばしば文脈依存型である。ある動作が許容されるかどうかは、局所的なプライバシー規範、組織ポリシー、そして事前デプロイ仕様化に抵抗するユーザーの期待に依存する。これにより実践的なギャップが生じる。ガードレールは自身の動作環境に適応しなければならないが、デプロイ後のフィードバックは通常、まばらでノイズの多いユーザー報告による障害に限られ、繰り返しのファインチューニングはしばしば非現実的である。このギャップに対処するため、我々はLiSA（Lifelong Safety Adaptation、生涯安全適応）を提案する。これは構造化された記憶を通じて固定ベースガードレールを改善する保守的政策誘導フレームワークである。LiSAは、散発的な障害を再利用可能な政策抽象化に変換することで、希少な報告が個別事例を超えて汎化できるようにする。さらに、混合ラベル文脈における過剰汎化を防ぐ競合認識型局所ルールを追加し、事後下界を介した証拠認識型信頼度ゲーティングを適用することで、記憶再利用が経験的精度だけでなく蓄積された証拠に応じてスケールするようにする。PrivacyLens+、ConFaide+、AgentHarm全体で、LiSAは希少フィードバック下で強力な記憶ベースベースラインを一貫して上回り、20%のラベル反転率でもノイズの多いユーザーフィードバック下で堅牢性を維持し、レイテンシと性能のフロンティアをバックボーンモデルスケーリングを超えて押し広げる。最終的に、LiSAは実世界のエッジリスクの予測不可能なロングテールに対してAIエージェントを安全にする実践的な道を提供する。

English

As AI agents move from chat interfaces to systems that read private data, call tools, and execute multi-step workflows, guardrails become a last line of defense against concrete deployment harms. In these settings, guardrail failures are no longer merely answer-quality errors: they can leak secrets, authorize unsafe actions, or block legitimate work. The hardest failures are often contextual: whether an action is acceptable depends on local privacy norms, organizational policies, and user expectations that resist pre-deployment specification. This creates a practical gap: guardrails must adapt to their own operating environments, yet deployment feedback is typically limited to sparse, noisy user-reported failures, and repeated fine-tuning is often impractical. To address this gap, we propose LiSA (Lifelong Safety Adaptation), a conservative policy induction framework that improves a fixed base guardrail through structured memory. LiSA converts occasional failures into reusable policy abstractions so that sparse reports can generalize beyond individual cases, adds conflict-aware local rules to prevent overgeneralization in mixed-label contexts, and applies evidence-aware confidence gating via a posterior lower bound, so that memory reuse scales with accumulated evidence rather than empirical accuracy alone. Across PrivacyLens+, ConFaide+, and AgentHarm, LiSA consistently outperforms strong memory-based baselines under sparse feedback, remains robust under noisy user feedback even at 20% label-flip rates, and pushes the latency--performance frontier beyond backbone model scaling. Ultimately, LiSA offers a practical path to secure AI agents against the unpredictable long tail of real-world edge risks.