EgoAVU：自我中心音频-视觉理解

摘要

理解自我中心视频对具身智能至关重要。当前的多模态大语言模型（MLLMs）已能同时接收视觉与听觉输入。然而，由于难以获取具有连贯跨模态信息的文本标注，MLLMs能否联合理解自我中心视频中的双模态信息仍待探索。针对该问题，我们提出EgoAVU——一个可扩展的数据引擎，能自动生成自我中心视角的视听叙事、问题及回答。EgoAVU通过多模态上下文增强人工叙事，并借助跨模态关联建模生成视听叙事。基于令牌的视频过滤与模块化图结构筛选机制确保了数据多样性与质量。基于EgoAVU，我们构建了包含300万样本的大规模训练数据集EgoAVU-Instruct，以及涵盖多任务的人工验证评估集EgoAVU-Bench。EgoAVU-Bench清晰揭示了现有MLLMs的局限：它们严重偏向视觉信号，常忽略听觉线索或无法将声音与视觉源对应。在EgoAVU-Instruct上微调MLLMs可有效解决此问题，使EgoAVU-Bench性能提升最高达113%。该优势还能迁移至EgoTempo、EgoIllusion等其他基准测试，实现最高28%的相对性能提升。代码将向社区开源。

English

Understanding egocentric videos plays a vital role for embodied intelligence. Recent multi-modal large language models (MLLMs) can accept both visual and audio inputs. However, due to the challenge of obtaining text labels with coherent joint-modality information, whether MLLMs can jointly understand both modalities in egocentric videos remains under-explored. To address this problem, we introduce EgoAVU, a scalable data engine to automatically generate egocentric audio-visual narrations, questions, and answers. EgoAVU enriches human narrations with multimodal context and generates audio-visual narrations through cross-modal correlation modeling. Token-based video filtering and modular, graph-based curation ensure both data diversity and quality. Leveraging EgoAVU, we construct EgoAVU-Instruct, a large-scale training dataset of 3M samples, and EgoAVU-Bench, a manually verified evaluation split covering diverse tasks. EgoAVU-Bench clearly reveals the limitations of existing MLLMs: they bias heavily toward visual signals, often neglecting audio cues or failing to correspond audio with the visual source. Finetuning MLLMs on EgoAVU-Instruct effectively addresses this issue, enabling up to 113% performance improvement on EgoAVU-Bench. Such benefits also transfer to other benchmarks such as EgoTempo and EgoIllusion, achieving up to 28% relative performance gain. Code will be released to the community.

EgoAVU：自我中心音频-视觉理解

EgoAVU: Egocentric Audio-Visual Understanding

摘要

Support