멀티모달 에이전트를 위한 과제 중심 기억

초록

장기 기억은 다중모드 에이전트가 일관된 경험을 구축하고, 세계 지식을 축적하며, 지속적 학습을 달성하는 데 필수적이다. 그러나 효과적인 기억을 구축하는 것은 기억 모듈 설계 및 정확성과 충실도 같은 기본 요구 사항을 넘어서며, 핵심 과제는 무엇을 기억할지 결정하는 데 있다. 구현형 에이전트와 같은 다중모드 에이전트는 실제 또는 가상 환경에서 지속적으로 지각하고 추론하며 행동하면서 무한한 스트림의 다중모드 관측값을 수신한다. 이러한 정보의 조합적 폭발 속에서 에이전트는 환경 내 자신의 역할과 관련되고 미래 작업에 유용한 내용을 선택적으로 유지해야 한다. 이러한 격차를 해소하기 위해, 우리는 기억 생성을 학습 가능한 암기 정책으로 구성하고, 정책이 환경에서 마주치는 실제 작업의 요구에 동적으로 초점을 조정할 수 있도록 하는 강화학습 기반 프레임워크인 TaskMem(Task-focused Memorization Policy Learning)을 소개한다. TaskMem은 2단계 훈련 패러다임을 채택한다. 1단계는 기본 충실도 요구 사항 하에서 기억 품질을 최적화하여 어떻게 기억할지를 학습하고, 2단계는 배포 후에 발생하며, 에이전트는 기본 MLLM(다중모드 대규모 언어 모델)에 어댑터를 조정하여 최근 환경 작업을 사용하여 보상 모델을 정의함으로써 무엇을 기억할지를 학습하며, 이를 통해 암기 정책을 작업 관련 내용으로 유도한다. 우리의 접근 방식을 평가하기 위해, VideoMME, EgoLife, EgoTempo를 에이전트가 스트리밍 관측값을 처리하고 온라인으로 도착하는 작업을 처리하는 현실적인 환경을 시뮬레이션하는 스트리밍 벤치마크로 재구성하였다. 메모리 평가를 분리하기 위해, 질문은 원시 비디오에 접근하지 않고 에이전트의 기억만을 사용하여 답변해야 한다. Qwen3-VL-30B-A3B를 기반으로 구축된 TaskMem은 이들 벤치마크에서 각각 6.3%, 7.0%, 5.3%의 VQA 정확도 향상을 달성하였다.

English

Long-term memory is essential for multimodal agents to build coherent experience, accumulate world knowledge, and achieve continual learning. However, constructing effective memory goes beyond memory module design and basic requirements such as accuracy and fidelity; the key challenge lies in determining what to memorize. Multimodal agents, such as embodied agents, continuously perceive, reason, and act in real or virtual environments, receiving an unbounded stream of multimodal observations. From this combinatorial explosion of information, an agent must selectively retain content that is relevant to its role in the environment and valuable for future tasks. To bridge this gap, we frame memory generation as a learnable memorization policy and introduce TaskMem (Task-focused Memorization Policy Learning), a reinforcement-learning-based framework that enables the policy to dynamically adjust its focus to the demands of real tasks encountered in the environment. TaskMem adopts a two-phase training paradigm: Phase One learns how to memorize by optimizing memory quality under fundamental fidelity requirements; Phase Two occurs after deployment, where the agent learns what to memorize by tuning an adapter on its base MLLM, using recent environment tasks to define a reward model that guides the memorization policy toward task-relevant content. To evaluate our approach, we reformulate VideoMME, EgoLife, and EgoTempo into streaming benchmarks that simulate a realistic setting in which an agent processes streaming observations and handles tasks arriving online. To isolate memory assessment, the questions must be answered using only the agent's memory, without access to raw video. Built on Qwen3-VL-30B-A3B, TaskMem improves VQA accuracy by 6.3%, 7.0%, and 5.3% on these benchmarks, respectively.