ChatPaper.aiChatPaper

視覺、聽覺、記憶與推理:具備長期記憶的多模態智能體

Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory

August 13, 2025
作者: Lin Long, Yichen He, Wentao Ye, Yiyuan Pan, Yuan Lin, Hang Li, Junbo Zhao, Wei Li
cs.AI

摘要

我们推出了M3-Agent,一种配备长期记忆的新型多模态代理框架。与人类相似,M3-Agent能够处理实时的视觉和听觉输入,以构建并更新其长期记忆。除了情景记忆外,它还发展出语义记忆,使其能够随时间积累世界知识。其记忆以实体为中心、多模态的形式组织,从而实现对环境更深层次且一致的理解。在接收到指令后,M3-Agent自主进行多轮迭代推理,并从记忆中检索相关信息以完成任务。为了评估多模态代理中记忆的有效性及基于记忆的推理能力,我们开发了M3-Bench,一个全新的长视频问答基准测试。M3-Bench包含100段新录制的从机器人视角捕捉的真实世界视频(M3-Bench-robot)以及929段来自网络、涵盖多种场景的视频(M3-Bench-web)。我们标注了旨在测试代理应用关键能力的问答对,如人类理解、通用知识提取和跨模态推理。实验结果显示,通过强化学习训练的M3-Agent,在M3-Bench-robot、M3-Bench-web和VideoMME-long上分别比使用Gemini-1.5-pro和GPT-4o的最强基线提示代理高出6.7%、7.7%和5.3%的准确率。我们的工作推动了多模态代理向更接近人类长期记忆的方向发展,并为其实际设计提供了洞见。模型、代码及数据可在https://github.com/bytedance-seed/m3-agent获取。
English
We introduce M3-Agent, a novel multimodal agent framework equipped with long-term memory. Like humans, M3-Agent can process real-time visual and auditory inputs to build and update its long-term memory. Beyond episodic memory, it also develops semantic memory, enabling it to accumulate world knowledge over time. Its memory is organized in an entity-centric, multimodal format, allowing deeper and more consistent understanding of the environment. Given an instruction, M3-Agent autonomously performs multi-turn, iterative reasoning and retrieves relevant information from memory to accomplish the task. To evaluate memory effectiveness and memory-based reasoning in multimodal agents, we develop M3-Bench, a new long-video question answering benchmark. M3-Bench comprises 100 newly recorded real-world videos captured from a robot's perspective (M3-Bench-robot) and 929 web-sourced videos across diverse scenarios (M3-Bench-web). We annotate question-answer pairs designed to test key capabilities essential for agent applications, such as human understanding, general knowledge extraction, and cross-modal reasoning. Experimental results show that M3-Agent, trained via reinforcement learning, outperforms the strongest baseline, a prompting agent using Gemini-1.5-pro and GPT-4o, achieving 6.7%, 7.7%, and 5.3% higher accuracy on M3-Bench-robot, M3-Bench-web and VideoMME-long, respectively. Our work advances the multimodal agents toward more human-like long-term memory and provides insights into their practical design. Model, code and data are available at https://github.com/bytedance-seed/m3-agent
PDF221August 14, 2025