다중 양식 평생 이해를 향하여: 데이터셋과 주체적 기초 모델

초록

비디오 이해를 위한 데이터셋이 시간 단위 규모로 확장되고 있지만, 이들은 일반적으로 자연스럽고 각본 없는 일상 생활과는 다른 형태로 밀집 연결된 클립으로 구성됩니다. 이러한 격차를 해소하기 위해 우리는 다중모달 평생 이해(Multimodal Lifelong Understanding)를 위해 설계된 데이터셋인 MM-Lifelong을 소개합니다. 총 181.1시간 분량의 영상으로 구성된 이 데이터셋은 다양한 시간적 밀도를 포착하기 위해 일(Day), 주(Week), 월(Month) 단위로 구조화되었습니다. 광범위한 평가를 통해 현재 패러다임의 두 가지 중요한 실패 모드를 확인했습니다: 종단형 다중모달 대형 언어 모델(MLLM)은 문맥 포화로 인한 작업 기억 병목(Working Memory Bottleneck)을 겪는 반면, 대표적인 에이전트 기반 베이스라인은 희소하고 한 달에 달하는 타임라인을 탐색할 때 전역 위치 파악 실패(Global Localization Collapse)를 경험합니다. 이를 해결하기 위해 우리는 동적 메모리 관리 방식을 통해 재귀적 신념 상태(recursive belief state)를 반복적으로 업데이트하는 재귀적 다중모달 에이전트(Recursive Multimodal Agent, ReMA)를 제안하며, 이는 기존 방법들을 크게 능가하는 성능을 보입니다. 마지막으로, 시간적 및 도메인 편향을 분리하기 위해 설계된 데이터셋 분할을 확립하여 지도 학습 및 분포 외 일반화에 대한 향후 연구를 위한 엄격한 기초를 마련합니다.

English

While datasets for video understanding have scaled to hour-long durations, they typically consist of densely concatenated clips that differ from natural, unscripted daily life. To bridge this gap, we introduce MM-Lifelong, a dataset designed for Multimodal Lifelong Understanding. Comprising 181.1 hours of footage, it is structured across Day, Week, and Month scales to capture varying temporal densities. Extensive evaluations reveal two critical failure modes in current paradigms: end-to-end MLLMs suffer from a Working Memory Bottleneck due to context saturation, while representative agentic baselines experience Global Localization Collapse when navigating sparse, month-long timelines. To address this, we propose the Recursive Multimodal Agent (ReMA), which employs dynamic memory management to iteratively update a recursive belief state, significantly outperforming existing methods. Finally, we establish dataset splits designed to isolate temporal and domain biases, providing a rigorous foundation for future research in supervised learning and out-of-distribution generalization.

다중 양식 평생 이해를 향하여: 데이터셋과 주체적 기초 모델

Towards Multimodal Lifelong Understanding: A Dataset and Agentic Baseline

초록

Support