观察、聆听、记忆与推理：具备长期记忆的多模态智能体

摘要

我们推出M3-Agent，一种配备长期记忆的新型多模态智能体框架。与人类相似，M3-Agent能够处理实时的视觉与听觉输入，构建并更新其长期记忆。除了情景记忆外，它还发展出语义记忆，使其能够随时间积累世界知识。其记忆以实体为中心、多模态的方式组织，促进了对环境更深层次且一致的理解。面对指令，M3-Agent能自主进行多轮迭代推理，并从记忆中检索相关信息以完成任务。为了评估多模态智能体中记忆的有效性及基于记忆的推理能力，我们开发了M3-Bench，一个全新的长视频问答基准。M3-Bench包含100段新录制的从机器人视角捕捉的真实世界视频（M3-Bench-robot）及929段涵盖多样场景的网络视频（M3-Bench-web）。我们标注了旨在测试智能体应用关键能力的问答对，如人类理解、通用知识提取及跨模态推理。实验结果显示，通过强化学习训练的M3-Agent超越了使用Gemini-1.5-pro和GPT-4o的最强提示型基线，在M3-Bench-robot、M3-Bench-web及VideoMME-long上的准确率分别提升了6.7%、7.7%和5.3%。我们的工作推动了多模态智能体向更接近人类长期记忆的方向发展，并为其实际设计提供了洞见。模型、代码及数据可在https://github.com/bytedance-seed/m3-agent获取。

English

We introduce M3-Agent, a novel multimodal agent framework equipped with long-term memory. Like humans, M3-Agent can process real-time visual and auditory inputs to build and update its long-term memory. Beyond episodic memory, it also develops semantic memory, enabling it to accumulate world knowledge over time. Its memory is organized in an entity-centric, multimodal format, allowing deeper and more consistent understanding of the environment. Given an instruction, M3-Agent autonomously performs multi-turn, iterative reasoning and retrieves relevant information from memory to accomplish the task. To evaluate memory effectiveness and memory-based reasoning in multimodal agents, we develop M3-Bench, a new long-video question answering benchmark. M3-Bench comprises 100 newly recorded real-world videos captured from a robot's perspective (M3-Bench-robot) and 929 web-sourced videos across diverse scenarios (M3-Bench-web). We annotate question-answer pairs designed to test key capabilities essential for agent applications, such as human understanding, general knowledge extraction, and cross-modal reasoning. Experimental results show that M3-Agent, trained via reinforcement learning, outperforms the strongest baseline, a prompting agent using Gemini-1.5-pro and GPT-4o, achieving 6.7%, 7.7%, and 5.3% higher accuracy on M3-Bench-robot, M3-Bench-web and VideoMME-long, respectively. Our work advances the multimodal agents toward more human-like long-term memory and provides insights into their practical design. Model, code and data are available at https://github.com/bytedance-seed/m3-agent

观察、聆听、记忆与推理：具备长期记忆的多模态智能体

Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory

摘要

Support