WorldMemArena：透過動作與世界互動評估多模態代理記憶

摘要

多模态大语言模型正越来越多地被部署为长期智能体，在此情境下，记忆必须超越简单回忆：它需要追踪不断演变的世界、修正已过时的信息，并在决策时刻提供恰当的证据。现有基准测试仅衡量静态对话中的回忆能力，将记忆简化为单一的任务结束准确率，并将视觉观测压缩为文字描述，导致我们无法将失败定位到记忆的写入、维护、检索或使用环节。自主创建记忆的智能体框架的兴起进一步加剧了这一差距，因为我们缺乏原则性方法来比较人工设计的流水线与自主管理方案。为弥补这些不足，我们将多模态智能体记忆形式化为一个具有可观察四阶段生命周期的"行动-世界交互循环"，并在WorldMemArena中实现：该基准包含400个多会话多模态任务，涵盖终身演化（演变的个人与任务状态）与自主执行（基于真实观测、行动和反馈的记忆），并标注了黄金记忆点、更新、干扰项以及用于阶段级诊断的证据链。这使得我们首次能够对长上下文、人工设计（RAG与外置记忆系统）及基于框架的记忆智能体进行直接对比。结果表明：(1)更好的记忆写入与存储并不必然带来更优的性能；(2)多模态记忆在充分利用视觉证据方面仍面临挑战；(3)系统在不同领域间表现不稳定，在更真实的智能体轨迹上性能下降；(4)基于框架的记忆虽更灵活，但成本高昂且可靠性较低。

English

Multimodal large language models are increasingly deployed as long-horizon agents, where memory must do more than recall: it must track an evolving world, revise what has gone stale, and surface the right evidence at decision time. Existing benchmarks measure recall over static dialogue, collapse memory into a single end-of-task accuracy, and reduce visual observations to captions, leaving us unable to localize failures to writing, maintenance, retrieval, or use. The rise of agent harnesses that author their own memory sharpens this gap, since we have no principled way to compare hand-designed pipelines with self-managing alternatives. To close these gaps, we formulate multimodal agent memory as an Action-World Interaction Loop with an observable four-stage lifecycle, and instantiate it in WorldMemArena: 400 multi-session multimodal tasks spanning Lifelong Evolution (evolving personal and task states) and Agentic Execution (memory from real observations, actions, and feedback), annotated with gold memory points, updates, distractors, and evidence chains for stage-level diagnosis. This enables the first head-to-head comparison of long-context, manually designed (RAG and external memory systems), and harness-based memory agents. Results show that: (1) better memory writing and storage do not guarantee better performance; (2) multimodal memory still struggles to fully use visual evidence; (3) systems are unstable across domains and degrade on realistic agentic trajectories; and (4) harness memory is more flexible but remains costly and less reliable.