GateMem: マルチプリンシパル共有メモリエージェントにおけるメモリガバナンスのベンチマーキング

要旨

LLMエージェント向けのメモリベンチマークは、ほとんどが単一ユーザー設定を前提としており、病院、職場、キャンパス、家庭向けの共有アシスタントは研究が不十分なままである。これらの展開では、複数の主体が共通のメモリプールに書き込み、異なる役割、範囲、関係性に基づいてクエリを実行するため、メモリの品質には想起だけでなくガバナンスも必要となる。我々は、複数主体共有メモリエージェントのためのベンチマークであるGateMemを紹介する。GateMemは、状態更新を伴う正当な長期リクエストに対する有用性、コンテキスト上の認可境界を越えたアクセス制御、明示的な削除リクエスト後のエージェント向け能動的忘却を同時に評価する。これは医療、オフィス、教育、家庭の各ドメインを網羅し、長文形式のマルチパーティエピソード、段階的なメモリ注入、隠されたチェックポイント、構造化された判定、リークターゲットアノテーションを備えている。多様なベースラインおよびバックボーンモデルにおいて、強力な有用性、ロバストなアクセス制御、信頼性の高い忘却を同時に達成する手法は存在しない。長文脈プロンプティングは高いトークンコストで最高のガバナンススコアをもたらすことが多いが、検索ベースおよび外部メモリ手法はコストを削減するものの、依然として不正アクセスや削除された情報を漏洩する。これらの結果は、現在のメモリエージェントが信頼性の高い共有機関展開には程遠いことを示している。

English

Memory benchmarks for LLM agents largely assume single-user settings, leaving shared assistants for hospitals, workplaces, campuses, and households understudied. In these deployments, multiple principals write to a common memory pool and query it under different roles, scopes, and relationships, so memory quality requires governance as well as recall. We introduce GateMem, a benchmark for multi-principal shared-memory agents. GateMem jointly evaluates utility for legitimate long-horizon requests with state updates, access control across contextual authorization boundaries, and agent-facing active forgetting after explicit deletion requests. It spans medical, office, education, and household domains, with long-form multi-party episodes, incremental memory injection, hidden checkpoints, structured judging, and leak-target annotations. Across diverse baselines and backbone models, no method simultaneously achieves strong utility, robust access control, and reliable forgetting. Long-context prompting often yields the best governance score at high token cost, while retrieval-based and external-memory methods reduce cost yet still leak unauthorized or deleted information. These results show current memory agents remain far from reliable shared institutional deployment.