GateMem: 多主体共享内存智能体的内存管控基准测试

摘要

针对大语言模型智能体的记忆基准测试大多假设单用户场景，这使得医院、工作场所、校园和家庭中共享助手的应用研究相对不足。在这些部署中，多个主体向公共记忆池写入数据，并在不同角色、范围和关系下进行查询，因此记忆质量不仅需要高效回忆，还需具备治理能力。我们提出GateMem——一个面向多主体共享记忆智能体的基准测试。GateMem联合评估了以下能力：对包含状态更新的合法长程请求的实用价值、跨上下文授权边界的访问控制，以及在执行明确删除请求后的主动遗忘功能。该基准涵盖医疗、办公、教育和家庭领域，包含多轮长会话、增量记忆注入、隐藏检查点、结构化评判以及泄露目标标注。在多种基线方法和骨干模型下，没有任何一种方法能同时实现强实用价值、稳健的访问控制和可靠的遗忘功能。长上下文提示虽然常以高令牌成本获得最佳治理得分，但基于检索和外部记忆的方法虽降低了成本，却仍会泄露未授权或已删除的信息。这些结果表明，当前的记忆智能体距离能够可靠部署于机构共享场景仍有很大差距。

English

Memory benchmarks for LLM agents largely assume single-user settings, leaving shared assistants for hospitals, workplaces, campuses, and households understudied. In these deployments, multiple principals write to a common memory pool and query it under different roles, scopes, and relationships, so memory quality requires governance as well as recall. We introduce GateMem, a benchmark for multi-principal shared-memory agents. GateMem jointly evaluates utility for legitimate long-horizon requests with state updates, access control across contextual authorization boundaries, and agent-facing active forgetting after explicit deletion requests. It spans medical, office, education, and household domains, with long-form multi-party episodes, incremental memory injection, hidden checkpoints, structured judging, and leak-target annotations. Across diverse baselines and backbone models, no method simultaneously achieves strong utility, robust access control, and reliable forgetting. Long-context prompting often yields the best governance score at high token cost, while retrieval-based and external-memory methods reduce cost yet still leak unauthorized or deleted information. These results show current memory agents remain far from reliable shared institutional deployment.