ChatPaper.aiChatPaper

MEMTRACK:在多平台动态代理环境中评估长期记忆与状态追踪能力

MEMTRACK: Evaluating Long-Term Memory and State Tracking in Multi-Platform Dynamic Agent Environments

October 1, 2025
作者: Darshan Deshpande, Varun Gangal, Hersh Mehta, Anand Kannappan, Rebecca Qian, Peng Wang
cs.AI

摘要

近期关于上下文与记忆基准测试的研究主要集中于对话场景,然而,在动态的企业环境中评估记忆能力对于其有效应用至关重要。我们推出了MEMTRACK,一个旨在多平台代理环境中评估长期记忆与状态跟踪的基准测试。MEMTRACK通过整合跨多个沟通与生产力平台(如Slack、Linear和Git)的异步事件,模拟了现实的组织工作流程。每个基准测试实例提供一个按时间顺序交叉排列的平台时间线,包含噪声、冲突、相互引用的信息,以及潜在的代码库/文件系统理解与探索。因此,我们的基准测试涵盖了记忆能力的多个方面,如获取、选择与冲突解决。MEMTRACK数据集通过专家手动设计与基于代理的可扩展合成相结合的方式精心构建,生成了基于真实世界软件开发过程的生态有效场景。我们引入了正确性、效率与冗余性等关键指标,这些指标超越了简单的问答性能,捕捉了记忆机制的有效性。对当前最先进的大型语言模型(LLMs)及记忆后端进行的实验揭示了在长时程记忆利用、跨平台依赖处理及矛盾解决方面存在的挑战。值得注意的是,表现最佳的GPT-5模型在MEMTRACK上仅获得了60%的正确性得分。本工作为推进记忆增强代理的评估研究提供了一个可扩展的框架,超越了现有对对话设置的关注,并为复杂组织环境下的多代理、多平台记忆基准测试奠定了基础。
English
Recent works on context and memory benchmarking have primarily focused on conversational instances but the need for evaluating memory in dynamic enterprise environments is crucial for its effective application. We introduce MEMTRACK, a benchmark designed to evaluate long-term memory and state tracking in multi-platform agent environments. MEMTRACK models realistic organizational workflows by integrating asynchronous events across multiple communication and productivity platforms such as Slack, Linear and Git. Each benchmark instance provides a chronologically platform-interleaved timeline, with noisy, conflicting, cross-referring information as well as potential codebase/file-system comprehension and exploration. Consequently, our benchmark tests memory capabilities such as acquistion, selection and conflict resolution. We curate the MEMTRACK dataset through both manual expert driven design and scalable agent based synthesis, generating ecologically valid scenarios grounded in real world software development processes. We introduce pertinent metrics for Correctness, Efficiency, and Redundancy that capture the effectiveness of memory mechanisms beyond simple QA performance. Experiments across SoTA LLMs and memory backends reveal challenges in utilizing memory across long horizons, handling cross-platform dependencies, and resolving contradictions. Notably, the best performing GPT-5 model only achieves a 60\% Correctness score on MEMTRACK. This work provides an extensible framework for advancing evaluation research for memory-augmented agents, beyond existing focus on conversational setups, and sets the stage for multi-agent, multi-platform memory benchmarking in complex organizational settings
PDF12October 8, 2025