观察、聆听、记忆与推理:具备长期记忆的多模态智能体
Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory
August 13, 2025
作者: Lin Long, Yichen He, Wentao Ye, Yiyuan Pan, Yuan Lin, Hang Li, Junbo Zhao, Wei Li
cs.AI
摘要
我们推出M3-Agent,一种配备长期记忆的新型多模态智能体框架。与人类相似,M3-Agent能够处理实时的视觉与听觉输入,构建并更新其长期记忆。除了情景记忆外,它还发展出语义记忆,使其能够随时间积累世界知识。其记忆以实体为中心、多模态的方式组织,促进了对环境更深层次且一致的理解。面对指令,M3-Agent能自主进行多轮迭代推理,并从记忆中检索相关信息以完成任务。为了评估多模态智能体中记忆的有效性及基于记忆的推理能力,我们开发了M3-Bench,一个全新的长视频问答基准。M3-Bench包含100段新录制的从机器人视角捕捉的真实世界视频(M3-Bench-robot)及929段涵盖多样场景的网络视频(M3-Bench-web)。我们标注了旨在测试智能体应用关键能力的问答对,如人类理解、通用知识提取及跨模态推理。实验结果显示,通过强化学习训练的M3-Agent超越了使用Gemini-1.5-pro和GPT-4o的最强提示型基线,在M3-Bench-robot、M3-Bench-web及VideoMME-long上的准确率分别提升了6.7%、7.7%和5.3%。我们的工作推动了多模态智能体向更接近人类长期记忆的方向发展,并为其实际设计提供了洞见。模型、代码及数据可在https://github.com/bytedance-seed/m3-agent获取。
English
We introduce M3-Agent, a novel multimodal agent framework equipped with
long-term memory. Like humans, M3-Agent can process real-time visual and
auditory inputs to build and update its long-term memory. Beyond episodic
memory, it also develops semantic memory, enabling it to accumulate world
knowledge over time. Its memory is organized in an entity-centric, multimodal
format, allowing deeper and more consistent understanding of the environment.
Given an instruction, M3-Agent autonomously performs multi-turn, iterative
reasoning and retrieves relevant information from memory to accomplish the
task. To evaluate memory effectiveness and memory-based reasoning in multimodal
agents, we develop M3-Bench, a new long-video question answering benchmark.
M3-Bench comprises 100 newly recorded real-world videos captured from a robot's
perspective (M3-Bench-robot) and 929 web-sourced videos across diverse
scenarios (M3-Bench-web). We annotate question-answer pairs designed to test
key capabilities essential for agent applications, such as human understanding,
general knowledge extraction, and cross-modal reasoning. Experimental results
show that M3-Agent, trained via reinforcement learning, outperforms the
strongest baseline, a prompting agent using Gemini-1.5-pro and GPT-4o,
achieving 6.7%, 7.7%, and 5.3% higher accuracy on M3-Bench-robot, M3-Bench-web
and VideoMME-long, respectively. Our work advances the multimodal agents toward
more human-like long-term memory and provides insights into their practical
design. Model, code and data are available at
https://github.com/bytedance-seed/m3-agent