針對多模態代理的任務聚焦記憶

摘要

長期記憶對多模態代理至關重要，使其能夠建立連貫經驗、累積世界知識，並實現持續學習。然而，構建有效的記憶不僅涉及記憶模組設計與準確性、忠實度等基本要求；關鍵挑戰在於決定應記憶哪些內容。多模態代理（例如具身代理）持續感知、推理並在真實或虛擬環境中行動，接收無界的多模態觀察串流。面對資訊的組合爆炸，代理必須選擇性地保留與其在環境中角色相關且對未來任務有價值的內容。為彌合這一差距，我們將記憶生成視為可學習的記憶化策略，並提出TaskMem（任務導向記憶化策略學習），這是一個基於強化學習的框架，使策略能動態調整其焦點，以因應環境中遇到的實際任務需求。TaskMem採用兩階段訓練範式：第一階段藉由在基本忠實度要求下優化記憶品質，學習如何記憶；第二階段在部署後進行，代理透過調整其基礎多模態大型語言模型上的適配器來學習應記憶哪些內容，並利用近期環境任務定義獎勵模型，引導記憶化策略聚焦於任務相關內容。為評估我們的方法，我們將VideoMME、EgoLife與EgoTempo重新設計為串流基準，模擬代理處理串流觀察與處理在線到達任務的真實場景。為獨立評估記憶，必須僅依賴代理的記憶回答問題，不存取原始影片。基於Qwen3-VL-30B-A3B，TaskMem在這些基準上的視覺問答準確率分別提升了6.3%、7.0%與5.3%。

English

Long-term memory is essential for multimodal agents to build coherent experience, accumulate world knowledge, and achieve continual learning. However, constructing effective memory goes beyond memory module design and basic requirements such as accuracy and fidelity; the key challenge lies in determining what to memorize. Multimodal agents, such as embodied agents, continuously perceive, reason, and act in real or virtual environments, receiving an unbounded stream of multimodal observations. From this combinatorial explosion of information, an agent must selectively retain content that is relevant to its role in the environment and valuable for future tasks. To bridge this gap, we frame memory generation as a learnable memorization policy and introduce TaskMem (Task-focused Memorization Policy Learning), a reinforcement-learning-based framework that enables the policy to dynamically adjust its focus to the demands of real tasks encountered in the environment. TaskMem adopts a two-phase training paradigm: Phase One learns how to memorize by optimizing memory quality under fundamental fidelity requirements; Phase Two occurs after deployment, where the agent learns what to memorize by tuning an adapter on its base MLLM, using recent environment tasks to define a reward model that guides the memorization policy toward task-relevant content. To evaluate our approach, we reformulate VideoMME, EgoLife, and EgoTempo into streaming benchmarks that simulate a realistic setting in which an agent processes streaming observations and handles tasks arriving online. To isolate memory assessment, the questions must be answered using only the agent's memory, without access to raw video. Built on Qwen3-VL-30B-A3B, TaskMem improves VQA accuracy by 6.3%, 7.0%, and 5.3% on these benchmarks, respectively.