EgoAVU: Egocentrisch Audiovisueel Begrip

Samenvatting

Het begrijpen van egocentrische video's speelt een cruciale rol voor belichaamde intelligentie. Recente multimodale grote taalmodellen (MLLM's) kunnen zowel visuele als auditieve invoer verwerken. Vanwege de uitdaging om tekstlabels te verkrijgen met coherente gezamenlijke modaliteitsinformatie, blijft echter onderbelicht of MLLM's beide modaliteiten gezamenlijk kunnen begrijpen in egocentrische video's. Om dit probleem aan te pakken, introduceren we EgoAVU, een schaalbare data-engine om automatisch egocentrische audio-visuele beschrijvingen, vragen en antwoorden te genereren. EgoAVU verrijkt menselijke beschrijvingen met multimodale context en genereert audio-visuele beschrijvingen door middel van cross-modale correlatiemodellering. Token-gebaseerde videofiltering en modulaire, op grafen gebaseerde curatie waarborgen zowel datadiversiteit als -kwaliteit. Met behulp van EgoAVU construeren we EgoAVU-Instruct, een grootschalige trainingsdataset van 3M samples, en EgoAVU-Bench, een handmatig geverifieerde evaluatieset die diverse taken dekt. EgoAVU-Bench onthult duidelijk de beperkingen van bestaande MLLM's: ze vertonen een sterke bias naar visuele signalen, waarbij ze vaak auditieve aanwijzingen verwaarlozen of er niet in slagen audio met de visuele bron te correleren. Het finetunen van MLLM's op EgoAVU-Instruct lost dit probleem effectief op, wat een prestatieverbetering tot 113% op EgoAVU-Bench mogelijk maakt. Deze voordelen transfereren ook naar andere benchmarks zoals EgoTempo en EgoIllusion, met een relatieve prestatieverbetering tot 28%. De code zal worden vrijgegeven aan de gemeenschap.

English

Understanding egocentric videos plays a vital role for embodied intelligence. Recent multi-modal large language models (MLLMs) can accept both visual and audio inputs. However, due to the challenge of obtaining text labels with coherent joint-modality information, whether MLLMs can jointly understand both modalities in egocentric videos remains under-explored. To address this problem, we introduce EgoAVU, a scalable data engine to automatically generate egocentric audio-visual narrations, questions, and answers. EgoAVU enriches human narrations with multimodal context and generates audio-visual narrations through cross-modal correlation modeling. Token-based video filtering and modular, graph-based curation ensure both data diversity and quality. Leveraging EgoAVU, we construct EgoAVU-Instruct, a large-scale training dataset of 3M samples, and EgoAVU-Bench, a manually verified evaluation split covering diverse tasks. EgoAVU-Bench clearly reveals the limitations of existing MLLMs: they bias heavily toward visual signals, often neglecting audio cues or failing to correspond audio with the visual source. Finetuning MLLMs on EgoAVU-Instruct effectively addresses this issue, enabling up to 113% performance improvement on EgoAVU-Bench. Such benefits also transfer to other benchmarks such as EgoTempo and EgoIllusion, achieving up to 28% relative performance gain. Code will be released to the community.

EgoAVU: Egocentrisch Audiovisueel Begrip

EgoAVU: Egocentric Audio-Visual Understanding

Samenvatting

Support