WorldMemArena: 행동-세계 상호작용을 통한 다중 양식 에이전트 메모리 평가

초록

멀티모달 대규모 언어 모델은 장기 에이전트로 점점 더 많이 배치되고 있으며, 이때 메모리는 단순한 회상을 넘어 진화하는 세계를 추적하고, 낡은 정보를 갱신하며, 의사 결정 시점에 적절한 증거를 제공해야 한다. 기존 벤치마크는 정적 대화에 대한 회상을 측정하고, 메모리를 단일 과제 종료 정확도로 축소하며, 시각적 관찰을 캡션으로 축소함으로써 기록, 유지, 검색, 사용 중 어느 단계에서 실패가 발생했는지 파악할 수 없게 한다. 스스로 메모리를 관리하는 에이전트 하네스의 등장은 이러한 격차를 더욱 심화시키는데, 수동 설계 파이프라인과 자체 관리 대안을 원칙적으로 비교할 방법이 없기 때문이다. 이러한 격차를 해소하기 위해, 우리는 멀티모달 에이전트 메모리를 관찰 가능한 4단계 생애주기를 가진 행동-세계 상호작용 루프로 정식화하고, 이를 WorldMemArena에 구현하였다. WorldMemArena는 평생 진화(진화하는 개인 및 과제 상태)와 에이전트 실행(실제 관찰, 행동, 피드백으로부터의 메모리)에 걸친 400개의 다중 세션 멀티모달 과제로 구성되며, 단계별 진단을 위한 실측 메모리 포인트, 업데이트, 방해 요소, 증거 체인이 주석으로 제공된다. 이를 통해 장기 컨텍스트, 수동 설계(RAG 및 외부 메모리 시스템), 하네스 기반 메모리 에이전트 간의 최초의 직접 비교가 가능해졌다. 결과는 다음과 같다: (1) 더 나은 메모리 기록과 저장이 더 나은 성능을 보장하지 않는다; (2) 멀티모달 메모리는 여전히 시각적 증거를 완전히 활용하는 데 어려움을 겪는다; (3) 시스템은 도메인 간에 불안정하며 현실적인 에이전트 궤적에서 성능이 저하된다; (4) 하네스 메모리는 더 유연하지만 여전히 비용이 많이 들고 신뢰성이 떨어진다.

English

Multimodal large language models are increasingly deployed as long-horizon agents, where memory must do more than recall: it must track an evolving world, revise what has gone stale, and surface the right evidence at decision time. Existing benchmarks measure recall over static dialogue, collapse memory into a single end-of-task accuracy, and reduce visual observations to captions, leaving us unable to localize failures to writing, maintenance, retrieval, or use. The rise of agent harnesses that author their own memory sharpens this gap, since we have no principled way to compare hand-designed pipelines with self-managing alternatives. To close these gaps, we formulate multimodal agent memory as an Action-World Interaction Loop with an observable four-stage lifecycle, and instantiate it in WorldMemArena: 400 multi-session multimodal tasks spanning Lifelong Evolution (evolving personal and task states) and Agentic Execution (memory from real observations, actions, and feedback), annotated with gold memory points, updates, distractors, and evidence chains for stage-level diagnosis. This enables the first head-to-head comparison of long-context, manually designed (RAG and external memory systems), and harness-based memory agents. Results show that: (1) better memory writing and storage do not guarantee better performance; (2) multimodal memory still struggles to fully use visual evidence; (3) systems are unstable across domains and degrade on realistic agentic trajectories; and (4) harness memory is more flexible but remains costly and less reliable.