GateMem: 다중 주체 공유 메모리 에이전트에서의 메모리 거버넌스 벤치마킹

초록

LLM 에이전트를 위한 메모리 벤치마크는 대부분 단일 사용자 설정을 가정하여, 병원, 직장, 캠퍼스 및 가정에서의 공유 어시스턴트에 대한 연구는 미흡한 실정이다. 이러한 배포 환경에서는 여러 주체가 공통 메모리 풀에 기록하고 다양한 역할, 범위 및 관계에 따라 이를 질의하므로, 메모리 품질은 단순한 회상뿐만 아니라 거버넌스도 필요하다. 우리는 다중 주체 공유 메모리 에이전트를 위한 벤치마크인 GateMem을 소개한다. GateMem은 상태 업데이트를 수반하는 정당한 장기 요청에 대한 유틸리티, 맥락적 권한 경계를 넘나드는 접근 제어, 그리고 명시적 삭제 요청 이후 에이전트가 직면하는 능동적 망각을 함께 평가한다. 이는 의료, 사무, 교육 및 가정 영역을 포괄하며, 장문의 다자간 에피소드, 점진적 메모리 주입, 은닉 검사 지점, 구조화된 판단, 그리고 유출 대상 주석을 포함한다. 다양한 기준 모델 및 백본 모델에 걸쳐, 어떤 방법도 강력한 유틸리티, 견고한 접근 제어, 그리고 신뢰할 수 있는 망각을 동시에 달성하지 못한다. 긴 맥락 프롬프팅은 높은 토큰 비용으로 최상의 거버넌스 점수를 산출하는 경우가 많지만, 검색 기반 및 외부 메모리 방법은 비용을 절감하면서도 권한이 없거나 삭제된 정보를 여전히 유출한다. 이러한 결과는 현재의 메모리 에이전트가 신뢰할 수 있는 공유 기관 배포에 아직 크게 미치지 못함을 보여준다.

English

Memory benchmarks for LLM agents largely assume single-user settings, leaving shared assistants for hospitals, workplaces, campuses, and households understudied. In these deployments, multiple principals write to a common memory pool and query it under different roles, scopes, and relationships, so memory quality requires governance as well as recall. We introduce GateMem, a benchmark for multi-principal shared-memory agents. GateMem jointly evaluates utility for legitimate long-horizon requests with state updates, access control across contextual authorization boundaries, and agent-facing active forgetting after explicit deletion requests. It spans medical, office, education, and household domains, with long-form multi-party episodes, incremental memory injection, hidden checkpoints, structured judging, and leak-target annotations. Across diverse baselines and backbone models, no method simultaneously achieves strong utility, robust access control, and reliable forgetting. Long-context prompting often yields the best governance score at high token cost, while retrieval-based and external-memory methods reduce cost yet still leak unauthorized or deleted information. These results show current memory agents remain far from reliable shared institutional deployment.