面向任务的多模态智能体记忆机制

摘要

长期记忆对多模态智能体构建连贯经验、积累世界知识及实现持续学习至关重要。然而，构建有效记忆的关键并非仅在于记忆模块设计与准确性、保真度等基本要求，核心挑战在于决定需要记忆的内容。具身智能体等多模态智能体在真实或虚拟环境中持续感知、推理并执行动作，会接收到无界的多模态观测流。面对这种信息组合爆炸，智能体必须选择性保留与环境角色相关且对未来任务有价值的内容。为解决这一矛盾，我们将记忆生成重构为可学习的记忆策略，提出TaskMem（任务导向的记忆策略学习）——一种基于强化学习的框架，使策略能根据环境中真实任务的需求动态调整关注重点。TaskMem采用两阶段训练范式：第一阶段在基础保真度要求下优化记忆质量，学习"如何记忆"；第二阶段在部署后进行，智能体通过微调其基础多模态大语言模型上的适配器，利用近期环境任务定义奖励模型，引导记忆策略聚焦任务相关内容。为评估该方法，我们将VideoMME、EgoLife和EgoTempo重新构建为流式基准测试，模拟智能体处理流式观测并应对在线到达任务的真实场景。为隔离记忆评估，问题必须仅通过智能体记忆回答，不得访问原始视频。基于Qwen3-VL-30B-A3B，TaskMem在这些基准测试上分别将VQA准确率提升6.3%、7.0%和5.3%。

English

Long-term memory is essential for multimodal agents to build coherent experience, accumulate world knowledge, and achieve continual learning. However, constructing effective memory goes beyond memory module design and basic requirements such as accuracy and fidelity; the key challenge lies in determining what to memorize. Multimodal agents, such as embodied agents, continuously perceive, reason, and act in real or virtual environments, receiving an unbounded stream of multimodal observations. From this combinatorial explosion of information, an agent must selectively retain content that is relevant to its role in the environment and valuable for future tasks. To bridge this gap, we frame memory generation as a learnable memorization policy and introduce TaskMem (Task-focused Memorization Policy Learning), a reinforcement-learning-based framework that enables the policy to dynamically adjust its focus to the demands of real tasks encountered in the environment. TaskMem adopts a two-phase training paradigm: Phase One learns how to memorize by optimizing memory quality under fundamental fidelity requirements; Phase Two occurs after deployment, where the agent learns what to memorize by tuning an adapter on its base MLLM, using recent environment tasks to define a reward model that guides the memorization policy toward task-relevant content. To evaluate our approach, we reformulate VideoMME, EgoLife, and EgoTempo into streaming benchmarks that simulate a realistic setting in which an agent processes streaming observations and handles tasks arriving online. To isolate memory assessment, the questions must be answered using only the agent's memory, without access to raw video. Built on Qwen3-VL-30B-A3B, TaskMem improves VQA accuracy by 6.3%, 7.0%, and 5.3% on these benchmarks, respectively.