マルチモーダル生涯理解に向けて：データセットとエージェント的ベースライン

要旨

ビデオ理解のためのデータセットが長時間化する中、既存データは密に連結されたクリップで構成されることが一般的であり、自然で台本のない日常生活とは異なる性質を持つ。この隔たりを埋めるため、我々はマルチモーダル生涯理解を目的としたデータセットMM-Lifelongを提案する。181.1時間の映像から構成され、様々な時間密度を捉えるため「日」「週」「月」という時間軸で構造化されている。詳細な評価により、現在のパラダイムに2つの重大な欠陥が存在することが明らかになった：エンドツーエンドの大規模言語モデルは文脈飽和による作業記憶ボトルネックに悩まされ、一方、代表的なエージェント型ベースラインは疎な月単位のタイムラインでのグローバル位置特定崩壊を起こす。この問題に対処するため、動的メモリ管理を用いて再帰的信念状態を反復的に更新するRecursive Multimodal Agent（ReMA）を提案し、既存手法を大幅に上回る性能を実証した。最後に、時間的偏りとドメイン偏りを分離するデータセット分割を確立し、教師あり学習と分布外汎化の将来研究に向けた厳密な基盤を提供する。

English

While datasets for video understanding have scaled to hour-long durations, they typically consist of densely concatenated clips that differ from natural, unscripted daily life. To bridge this gap, we introduce MM-Lifelong, a dataset designed for Multimodal Lifelong Understanding. Comprising 181.1 hours of footage, it is structured across Day, Week, and Month scales to capture varying temporal densities. Extensive evaluations reveal two critical failure modes in current paradigms: end-to-end MLLMs suffer from a Working Memory Bottleneck due to context saturation, while representative agentic baselines experience Global Localization Collapse when navigating sparse, month-long timelines. To address this, we propose the Recursive Multimodal Agent (ReMA), which employs dynamic memory management to iteratively update a recursive belief state, significantly outperforming existing methods. Finally, we establish dataset splits designed to isolate temporal and domain biases, providing a rigorous foundation for future research in supervised learning and out-of-distribution generalization.

マルチモーダル生涯理解に向けて：データセットとエージェント的ベースライン

Towards Multimodal Lifelong Understanding: A Dataset and Agentic Baseline

要旨

Support