WorldMemArena: 行動と世界の相互作用によるマルチモーダルエージェントの記憶評価

要旨

マルチモーダル大規模言語モデルは、長期的なタスクを遂行するエージェントとしてますます活用されるようになっている。この場合、メモリは単なる記憶の想起にとどまらず、変化する世界の追跡、陳腐化した情報の修正、意思決定時における適切な証拠の提示を行う必要がある。既存のベンチマークは、静的な対話における想起を測定し、メモリをタスク終了時の単一の正解率に還元し、視覚的観測をキャプションに縮約しているため、書き込み、維持、検索、利用といった各段階における障害を特定できない。さらに、自らメモリを管理するエージェント用ハーネスの台頭により、手動設計のパイプラインと自己管理型の代替手法を原理的に比較する手段がないため、このギャップは一層顕著になっている。これらの課題を解決するために、マルチモーダルエージェントのメモリを、観測可能な4段階のライフサイクルを持つ行動-世界相互作用ループとして定式化し、それをWorldMemArenaとして具体化した。これは、生涯学習（進化する個人状態とタスク状態）とエージェント実行（実際の観測、行動、フィードバックに基づくメモリ）にわたる400のマルチセッション・マルチモーダルタスクから構成され、各段階の診断のために、金標準のメモリポイント、更新、ディストラクタ、証拠連鎖が注釈されている。これにより、長いコンテキストを持つシステム、手動設計のシステム（RAGや外部メモリシステム）、ハーネスベースのメモリエージェントの初めての直接比較が可能となる。結果は以下のことを示している。(1) メモリの書き込みと保存が優れていても、必ずしも性能が向上するとは限らない。(2) マルチモーダルメモリは、視覚的証拠を十分に活用することに依然として苦慮している。(3) システムは領域によって不安定であり、現実的なエージェント軌道上で性能が低下する。(4) ハーネスメモリは柔軟性が高いものの、依然としてコストが高く信頼性に劣る。

English

Multimodal large language models are increasingly deployed as long-horizon agents, where memory must do more than recall: it must track an evolving world, revise what has gone stale, and surface the right evidence at decision time. Existing benchmarks measure recall over static dialogue, collapse memory into a single end-of-task accuracy, and reduce visual observations to captions, leaving us unable to localize failures to writing, maintenance, retrieval, or use. The rise of agent harnesses that author their own memory sharpens this gap, since we have no principled way to compare hand-designed pipelines with self-managing alternatives. To close these gaps, we formulate multimodal agent memory as an Action-World Interaction Loop with an observable four-stage lifecycle, and instantiate it in WorldMemArena: 400 multi-session multimodal tasks spanning Lifelong Evolution (evolving personal and task states) and Agentic Execution (memory from real observations, actions, and feedback), annotated with gold memory points, updates, distractors, and evidence chains for stage-level diagnosis. This enables the first head-to-head comparison of long-context, manually designed (RAG and external memory systems), and harness-based memory agents. Results show that: (1) better memory writing and storage do not guarantee better performance; (2) multimodal memory still struggles to fully use visual evidence; (3) systems are unstable across domains and degrade on realistic agentic trajectories; and (4) harness memory is more flexible but remains costly and less reliable.