マルチモーダルエージェントのためのタスク集中型記憶

要旨

長期記憶は、マルチモーダルエージェントが一貫した経験を構築し、世界知識を蓄積し、継続的学習を実現するために不可欠である。しかし、効果的な記憶の構築は、記憶モジュールの設計や精度・忠実性といった基本的要件を超えており、鍵となる課題は「何を記憶すべきか」を決定することにある。身体性エージェントなどのマルチモーダルエージェントは、実環境または仮想環境において継続的に知覚・推論・行動を行い、無制限のマルチモーダル観測ストリームを受け取る。この情報の組合せ爆発の中から、エージェントは環境内での役割に関連し、将来のタスクにとって価値のあるコンテンツを選択的に保持しなければならない。このギャップを埋めるため、我々は記憶生成を学習可能な記憶化方策として捉え、TaskMem（タスク焦点型記憶化方策学習）を導入する。これは強化学習に基づくフレームワークであり、方策が環境内で遭遇する実タスクの要求に応じてその焦点を動的に調整することを可能にする。TaskMemは2段階の学習パラダイムを採用する。第1段階では、基本的な忠実性要件の下で記憶品質を最適化することにより、「どのように記憶するか」を学習する。第2段階は展開後に行われ、エージェントはベースMLLM上でアダプタをチューニングし、直近の環境タスクを用いて報酬モデルを定義することで、記憶化方策をタスク関連コンテンツへと導く「何を記憶するか」を学習する。本手法を評価するために、VideoMME、EgoLife、EgoTempoをストリーミングベンチマークとして再構成し、エージェントがストリーミング観測を処理し、オンラインで到着するタスクに対処する現実的な設定をシミュレートする。記憶評価を分離するため、質問にはエージェントの記憶のみを用いて回答し、生の動画にはアクセスできないものとする。Qwen3-VL-30B-A3Bを基盤とするTaskMemは、これらのベンチマークにおいてVQA精度をそれぞれ6.3%、7.0%、5.3%向上させる。

English

Long-term memory is essential for multimodal agents to build coherent experience, accumulate world knowledge, and achieve continual learning. However, constructing effective memory goes beyond memory module design and basic requirements such as accuracy and fidelity; the key challenge lies in determining what to memorize. Multimodal agents, such as embodied agents, continuously perceive, reason, and act in real or virtual environments, receiving an unbounded stream of multimodal observations. From this combinatorial explosion of information, an agent must selectively retain content that is relevant to its role in the environment and valuable for future tasks. To bridge this gap, we frame memory generation as a learnable memorization policy and introduce TaskMem (Task-focused Memorization Policy Learning), a reinforcement-learning-based framework that enables the policy to dynamically adjust its focus to the demands of real tasks encountered in the environment. TaskMem adopts a two-phase training paradigm: Phase One learns how to memorize by optimizing memory quality under fundamental fidelity requirements; Phase Two occurs after deployment, where the agent learns what to memorize by tuning an adapter on its base MLLM, using recent environment tasks to define a reward model that guides the memorization policy toward task-relevant content. To evaluate our approach, we reformulate VideoMME, EgoLife, and EgoTempo into streaming benchmarks that simulate a realistic setting in which an agent processes streaming observations and handles tasks arriving online. To isolate memory assessment, the questions must be answered using only the agent's memory, without access to raw video. Built on Qwen3-VL-30B-A3B, TaskMem improves VQA accuracy by 6.3%, 7.0%, and 5.3% on these benchmarks, respectively.