視覚、聴覚、記憶、推論：長期記憶を備えたマルチモーダルエージェント

要旨

我々は、長期記憶を備えた新しいマルチモーダルエージェントフレームワークであるM3-Agentを紹介する。人間と同様に、M3-Agentはリアルタイムの視覚および聴覚入力を処理し、長期記憶を構築・更新することができる。エピソード記憶に加えて、意味記憶も発達させ、時間の経過とともに世界知識を蓄積することが可能である。その記憶はエンティティ中心のマルチモーダル形式で組織化されており、環境に対するより深く一貫した理解を可能にする。指示が与えられると、M3-Agentは自律的に多段階の反復推論を行い、タスクを達成するために記憶から関連情報を検索する。マルチモーダルエージェントにおける記憶の有効性と記憶に基づく推論を評価するために、我々は新しい長編動画質問応答ベンチマークであるM3-Benchを開発した。M3-Benchは、ロボットの視点から撮影された100本の新規実写動画（M3-Bench-robot）と、多様なシナリオにわたる929本のウェブソース動画（M3-Bench-web）で構成されている。我々は、エージェントアプリケーションに不可欠な主要な能力（人間の理解、一般知識の抽出、クロスモーダル推論など）をテストするために設計された質問-回答ペアを注釈した。実験結果は、強化学習によって訓練されたM3-Agentが、Gemini-1.5-proとGPT-4oを使用したプロンプティングエージェントという最強のベースラインを上回り、M3-Bench-robot、M3-Bench-web、およびVideoMME-longにおいてそれぞれ6.7%、7.7%、5.3%高い精度を達成したことを示している。我々の研究は、マルチモーダルエージェントをより人間らしい長期記憶に向けて前進させ、その実用的な設計に関する洞察を提供する。モデル、コード、データはhttps://github.com/bytedance-seed/m3-agentで入手可能である。

English

We introduce M3-Agent, a novel multimodal agent framework equipped with long-term memory. Like humans, M3-Agent can process real-time visual and auditory inputs to build and update its long-term memory. Beyond episodic memory, it also develops semantic memory, enabling it to accumulate world knowledge over time. Its memory is organized in an entity-centric, multimodal format, allowing deeper and more consistent understanding of the environment. Given an instruction, M3-Agent autonomously performs multi-turn, iterative reasoning and retrieves relevant information from memory to accomplish the task. To evaluate memory effectiveness and memory-based reasoning in multimodal agents, we develop M3-Bench, a new long-video question answering benchmark. M3-Bench comprises 100 newly recorded real-world videos captured from a robot's perspective (M3-Bench-robot) and 929 web-sourced videos across diverse scenarios (M3-Bench-web). We annotate question-answer pairs designed to test key capabilities essential for agent applications, such as human understanding, general knowledge extraction, and cross-modal reasoning. Experimental results show that M3-Agent, trained via reinforcement learning, outperforms the strongest baseline, a prompting agent using Gemini-1.5-pro and GPT-4o, achieving 6.7%, 7.7%, and 5.3% higher accuracy on M3-Bench-robot, M3-Bench-web and VideoMME-long, respectively. Our work advances the multimodal agents toward more human-like long-term memory and provides insights into their practical design. Model, code and data are available at https://github.com/bytedance-seed/m3-agent

視覚、聴覚、記憶、推論：長期記憶を備えたマルチモーダルエージェント

Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory

要旨

Support