視覺、聽覺、記憶與推理：具備長期記憶的多模態智能體

摘要

我们推出了M3-Agent，一种配备长期记忆的新型多模态代理框架。与人类相似，M3-Agent能够处理实时的视觉和听觉输入，以构建并更新其长期记忆。除了情景记忆外，它还发展出语义记忆，使其能够随时间积累世界知识。其记忆以实体为中心、多模态的形式组织，从而实现对环境更深层次且一致的理解。在接收到指令后，M3-Agent自主进行多轮迭代推理，并从记忆中检索相关信息以完成任务。为了评估多模态代理中记忆的有效性及基于记忆的推理能力，我们开发了M3-Bench，一个全新的长视频问答基准测试。M3-Bench包含100段新录制的从机器人视角捕捉的真实世界视频（M3-Bench-robot）以及929段来自网络、涵盖多种场景的视频（M3-Bench-web）。我们标注了旨在测试代理应用关键能力的问答对，如人类理解、通用知识提取和跨模态推理。实验结果显示，通过强化学习训练的M3-Agent，在M3-Bench-robot、M3-Bench-web和VideoMME-long上分别比使用Gemini-1.5-pro和GPT-4o的最强基线提示代理高出6.7%、7.7%和5.3%的准确率。我们的工作推动了多模态代理向更接近人类长期记忆的方向发展，并为其实际设计提供了洞见。模型、代码及数据可在https://github.com/bytedance-seed/m3-agent获取。

English

We introduce M3-Agent, a novel multimodal agent framework equipped with long-term memory. Like humans, M3-Agent can process real-time visual and auditory inputs to build and update its long-term memory. Beyond episodic memory, it also develops semantic memory, enabling it to accumulate world knowledge over time. Its memory is organized in an entity-centric, multimodal format, allowing deeper and more consistent understanding of the environment. Given an instruction, M3-Agent autonomously performs multi-turn, iterative reasoning and retrieves relevant information from memory to accomplish the task. To evaluate memory effectiveness and memory-based reasoning in multimodal agents, we develop M3-Bench, a new long-video question answering benchmark. M3-Bench comprises 100 newly recorded real-world videos captured from a robot's perspective (M3-Bench-robot) and 929 web-sourced videos across diverse scenarios (M3-Bench-web). We annotate question-answer pairs designed to test key capabilities essential for agent applications, such as human understanding, general knowledge extraction, and cross-modal reasoning. Experimental results show that M3-Agent, trained via reinforcement learning, outperforms the strongest baseline, a prompting agent using Gemini-1.5-pro and GPT-4o, achieving 6.7%, 7.7%, and 5.3% higher accuracy on M3-Bench-robot, M3-Bench-web and VideoMME-long, respectively. Our work advances the multimodal agents toward more human-like long-term memory and provides insights into their practical design. Model, code and data are available at https://github.com/bytedance-seed/m3-agent

視覺、聽覺、記憶與推理：具備長期記憶的多模態智能體

Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory

摘要

Support