MEMTRACK:評估多平台動態代理環境中的長期記憶與狀態追蹤
MEMTRACK: Evaluating Long-Term Memory and State Tracking in Multi-Platform Dynamic Agent Environments
October 1, 2025
作者: Darshan Deshpande, Varun Gangal, Hersh Mehta, Anand Kannappan, Rebecca Qian, Peng Wang
cs.AI
摘要
近期關於上下文與記憶基準測試的研究主要集中於對話實例,然而,在動態企業環境中評估記憶能力對於其有效應用至關重要。我們引入了MEMTRACK,這是一個旨在多平台代理環境中評估長期記憶與狀態追蹤的基準測試。MEMTRACK通過整合跨多個通信與生產力平台(如Slack、Linear和Git)的異步事件,模擬了現實的組織工作流程。每個基準測試實例提供了一個按時間順序交織的平台時間線,其中包含嘈雜、矛盾、相互參照的信息,以及對代碼庫/文件系統的理解與探索需求。因此,我們的基準測試考察了記憶能力的多個方面,如獲取、選擇與衝突解決。我們通過專家手工設計與基於代理的可擴展合成方法,精心構建了MEMTRACK數據集,生成了基於真實世界軟件開發過程的生態效度場景。我們引入了針對正確性、效率與冗餘性的相關指標,這些指標捕捉了記憶機制在簡單問答性能之外的有效性。對當前最先進的大型語言模型(LLMs)及記憶後端的實驗揭示了在長時程記憶利用、跨平台依賴處理以及矛盾解決方面的挑戰。值得注意的是,表現最佳的GPT-5模型在MEMTRACK上的正確性得分僅為60%。這項工作為推進記憶增強代理的評估研究提供了一個可擴展的框架,超越了現有對對話設置的關注,並為複雜組織環境下的多代理、多平台記憶基準測試奠定了基礎。
English
Recent works on context and memory benchmarking have primarily focused on
conversational instances but the need for evaluating memory in dynamic
enterprise environments is crucial for its effective application. We introduce
MEMTRACK, a benchmark designed to evaluate long-term memory and state tracking
in multi-platform agent environments. MEMTRACK models realistic organizational
workflows by integrating asynchronous events across multiple communication and
productivity platforms such as Slack, Linear and Git. Each benchmark instance
provides a chronologically platform-interleaved timeline, with noisy,
conflicting, cross-referring information as well as potential
codebase/file-system comprehension and exploration. Consequently, our benchmark
tests memory capabilities such as acquistion, selection and conflict
resolution. We curate the MEMTRACK dataset through both manual expert driven
design and scalable agent based synthesis, generating ecologically valid
scenarios grounded in real world software development processes. We introduce
pertinent metrics for Correctness, Efficiency, and Redundancy that capture the
effectiveness of memory mechanisms beyond simple QA performance. Experiments
across SoTA LLMs and memory backends reveal challenges in utilizing memory
across long horizons, handling cross-platform dependencies, and resolving
contradictions. Notably, the best performing GPT-5 model only achieves a 60\%
Correctness score on MEMTRACK. This work provides an extensible framework for
advancing evaluation research for memory-augmented agents, beyond existing
focus on conversational setups, and sets the stage for multi-agent,
multi-platform memory benchmarking in complex organizational settings