MEMTRACK: マルチプラットフォーム動的エージェント環境における長期記憶と状態追跡の評価

要旨

近年のコンテキストとメモリのベンチマーク研究は、主に対話インスタンスに焦点を当ててきたが、動的な企業環境におけるメモリ評価の必要性は、その効果的な応用にとって極めて重要である。本論文では、マルチプラットフォームエージェント環境における長期記憶と状態追跡を評価するために設計されたベンチマーク「MEMTRACK」を紹介する。MEMTRACKは、Slack、Linear、Gitなどの複数のコミュニケーションおよび生産性プラットフォームにわたる非同期イベントを統合することで、現実的な組織ワークフローをモデル化する。各ベンチマークインスタンスは、時系列的にプラットフォームが交錯するタイムラインを提供し、ノイズの多い、矛盾した、相互参照情報、および潜在的なコードベース/ファイルシステムの理解と探索を含む。その結果、本ベンチマークは、取得、選択、矛盾解決などのメモリ能力をテストする。MEMTRACKデータセットは、専門家による手動設計とスケーラブルなエージェントベースの合成を通じてキュレーションされ、現実世界のソフトウェア開発プロセスに基づいた生態学的に妥当なシナリオを生成する。本論文では、単純なQA性能を超えたメモリメカニズムの有効性を捉えるための、正確性、効率性、冗長性に関する適切な指標を導入する。最先端のLLMとメモリバックエンドを用いた実験では、長期間にわたるメモリの活用、クロスプラットフォーム依存関係の処理、矛盾の解決における課題が明らかになった。特に、最高性能のGPT-5モデルでも、MEMTRACKでの正確性スコアは60％に留まった。本論文は、既存の対話設定に焦点を当てた研究を超えて、メモリ強化エージェントの評価研究を進めるための拡張可能なフレームワークを提供し、複雑な組織環境におけるマルチエージェント、マルチプラットフォームのメモリベンチマークの基盤を築くものである。

English

Recent works on context and memory benchmarking have primarily focused on conversational instances but the need for evaluating memory in dynamic enterprise environments is crucial for its effective application. We introduce MEMTRACK, a benchmark designed to evaluate long-term memory and state tracking in multi-platform agent environments. MEMTRACK models realistic organizational workflows by integrating asynchronous events across multiple communication and productivity platforms such as Slack, Linear and Git. Each benchmark instance provides a chronologically platform-interleaved timeline, with noisy, conflicting, cross-referring information as well as potential codebase/file-system comprehension and exploration. Consequently, our benchmark tests memory capabilities such as acquistion, selection and conflict resolution. We curate the MEMTRACK dataset through both manual expert driven design and scalable agent based synthesis, generating ecologically valid scenarios grounded in real world software development processes. We introduce pertinent metrics for Correctness, Efficiency, and Redundancy that capture the effectiveness of memory mechanisms beyond simple QA performance. Experiments across SoTA LLMs and memory backends reveal challenges in utilizing memory across long horizons, handling cross-platform dependencies, and resolving contradictions. Notably, the best performing GPT-5 model only achieves a 60\% Correctness score on MEMTRACK. This work provides an extensible framework for advancing evaluation research for memory-augmented agents, beyond existing focus on conversational setups, and sets the stage for multi-agent, multi-platform memory benchmarking in complex organizational settings

MEMTRACK: マルチプラットフォーム動的エージェント環境における長期記憶と状態追跡の評価

MEMTRACK: Evaluating Long-Term Memory and State Tracking in Multi-Platform Dynamic Agent Environments

要旨

Support