WorldMemArena: 通过动作-世界交互评估多模态智能体记忆
WorldMemArena: Evaluating Multimodal Agent Memory Through Action-World Interaction
May 28, 2026
作者: Chengzhi Liu, Yuzhe Yang, Sophia Xiao Pu, Yepeng Liu, Lin Long, Yichen Guo, Nuo Chen, Zhaotian Weng, Elena Kochkina, Simerjot Kaur, Charese Smiley, Xiaomo Liu, James Zou, Sheng Liu, Yuheng Bu, Songyou Peng, Xin Eric Wang
cs.AI
摘要
多模态大语言模型正越来越多地被部署为长期智能体,其记忆系统不仅要完成回忆功能,还需跟踪不断演变的环境、修正过时信息,并在决策时刻提供恰当的证据。现有基准测试仅评估静态对话中的回忆能力,将记忆压缩为单一的任务结束准确率,并将视觉观察降级为文字说明,导致我们无法定位记忆在写入、维护、检索或使用各环节的失败。随着能够自主生成记忆的智能体框架的出现,这一差距进一步凸显,因为我们缺乏系统化的方法来比较手工设计的流水线与自管理方案。为弥合这些差距,我们将多模态智能体记忆形式化为一个具有可观测四阶段生命周期的"动作-世界交互循环",并在WorldMemArena中实现该框架:包含400项多会话多模态任务,涵盖终身演化(个人与任务状态的持续更新)和代理执行(基于真实观察、行动和反馈的记忆),并标注了黄金记忆点、更新、干扰项和证据链,支持分阶段诊断。这使得我们首次能够对长上下文记忆、手工设计(RAG与外部记忆系统)以及基于框架的记忆智能体进行公平对比。结果显示:(1)更好的记忆写入与存储并不保证更优的整体性能;(2)多模态记忆仍难以充分利用视觉证据;(3)各系统在不同领域中表现不稳定,并在真实的智能体轨迹中性能下降;(4)框架式记忆更灵活,但代价高且可靠性较低。
English
Multimodal large language models are increasingly deployed as long-horizon agents, where memory must do more than recall: it must track an evolving world, revise what has gone stale, and surface the right evidence at decision time. Existing benchmarks measure recall over static dialogue, collapse memory into a single end-of-task accuracy, and reduce visual observations to captions, leaving us unable to localize failures to writing, maintenance, retrieval, or use. The rise of agent harnesses that author their own memory sharpens this gap, since we have no principled way to compare hand-designed pipelines with self-managing alternatives. To close these gaps, we formulate multimodal agent memory as an Action-World Interaction Loop with an observable four-stage lifecycle, and instantiate it in WorldMemArena: 400 multi-session multimodal tasks spanning Lifelong Evolution (evolving personal and task states) and Agentic Execution (memory from real observations, actions, and feedback), annotated with gold memory points, updates, distractors, and evidence chains for stage-level diagnosis. This enables the first head-to-head comparison of long-context, manually designed (RAG and external memory systems), and harness-based memory agents. Results show that: (1) better memory writing and storage do not guarantee better performance; (2) multimodal memory still struggles to fully use visual evidence; (3) systems are unstable across domains and degrade on realistic agentic trajectories; and (4) harness memory is more flexible but remains costly and less reliable.