SafeHarbor: 面向LLM智能体安全的层次化记忆增强护栏

摘要

随着基础模型的快速发展，大语言模型代理展现出日益强大的工具使用能力。然而，这一能力也引入了显著的安全风险——恶意行为者可操纵代理执行工具以生成有害内容。现有防御机制虽有效，但常存在过度拒绝问题，即提升安全严格性会削弱代理在良性任务上的实用性。为缓解这一矛盾，我们提出SafeHarbor框架，旨在为大语言模型代理建立精准的决策边界。与静态准则不同，SafeHarbor通过增强式对抗生成技术提取上下文感知的防御规则。我们设计了局部层级记忆系统以动态注入规则，提供免训练、高效且即插即用的解决方案。此外，我们引入基于信息熵的自我进化机制，通过动态节点分裂与合并持续优化记忆结构。大量实验表明，SafeHarbor在模糊良性任务与显式恶意攻击中均达到最先进性能，尤其在GPT-4o上兼顾了高达63.6%的峰值良性效用与对有害请求超93%的稳健拒绝率。源代码已开源至https://github.com/ljj-cyber/SafeHarbor。

English

With the rapid evolution of foundation models, Large Language Model (LLM) agents have demonstrated increasingly powerful tool-use capabilities. However, this proficiency introduces significant security risks, as malicious actors can manipulate agents into executing tools to generate harmful content. While existing defensive mechanisms are effective, they frequently suffer from the over-refusal problem, where increased safety strictness compromises the agent's utility on benign tasks. To mitigate this trade-off, we propose SafeHarbor, a novel framework designed to establish precise decision boundaries for LLM agents. Unlike static guidelines, SafeHarbor extracts context-aware defense rules through enhanced adversarial generation. We design a local hierarchical memory system for dynamic rule injection, offering a training-free, efficient, and plug-and-play solution. Furthermore, we introduce an information entropy-based self-evolution mechanism that continuously optimizes the memory structure through dynamic node splitting and merging. Extensive experiments demonstrate that SafeHarbor achieves state-of-the-art performance on both ambiguous benign tasks and explicit malicious attacks, notably attaining a peak benign utility of 63.6\% on GPT-4o while maintaining a robust refusal rate exceeding 93\% against harmful requests. The source code is publicly available at https://github.com/ljj-cyber/SafeHarbor.