EgoAVU：エゴセントリック（自己中心視点）音響-視覚理解

要旨

エゴセントリック映像の理解は、具身化知能において極めて重要である。近年のマルチモーダル大規模言語モデル（MLLM）は視覚・聴覚両方の入力を処理可能だが、整合性のあるマルチモーダル情報を伴うテキストラベルの取得が困難なため、MLLMがエゴセントリック映像において両モダリティを統合的に理解できるかは未解明のままであった。この課題に対処するため、我々はエゴセントリック映像向けの音声・視覚ナレーション、質問、回答を自動生成するスケーラブルなデータエンジン「EgoAVU」を提案する。EgoAVUは人間のナレーションをマルチモーダル文脈で拡張し、クロスモーダル相関モデリングを通じて視聴覚ナレーションを生成する。トークンベースの映像フィルタリングとモジュール化されたグラフベースの精選により、データの多様性と品質を両立させる。EgoAVUを活用し、300万サンプルからなる大規模訓練データセット「EgoAVU-Instruct」と、多様なタスクを網羅する手動検証済み評価データ「EgoAVU-Bench」を構築した。EgoAVU-Benchにより、既存MLLMが視覚信号に過度に依存し、音声手がかりを無視あるいは音源と視覚情報の対応付けに失敗するという限界が明らかとなった。EgoAVU-InstructでMLLMをファインチューニングすることでこの課題は効果的に解決され、EgoAVU-Benchにおいて最大113%の性能向上を達成した。この改善効果はEgoTempoやEgoIllusionなどの他ベンチマークにも転移し、最大28%の相対性能向上を実現した。コードは公開予定である。

English

Understanding egocentric videos plays a vital role for embodied intelligence. Recent multi-modal large language models (MLLMs) can accept both visual and audio inputs. However, due to the challenge of obtaining text labels with coherent joint-modality information, whether MLLMs can jointly understand both modalities in egocentric videos remains under-explored. To address this problem, we introduce EgoAVU, a scalable data engine to automatically generate egocentric audio-visual narrations, questions, and answers. EgoAVU enriches human narrations with multimodal context and generates audio-visual narrations through cross-modal correlation modeling. Token-based video filtering and modular, graph-based curation ensure both data diversity and quality. Leveraging EgoAVU, we construct EgoAVU-Instruct, a large-scale training dataset of 3M samples, and EgoAVU-Bench, a manually verified evaluation split covering diverse tasks. EgoAVU-Bench clearly reveals the limitations of existing MLLMs: they bias heavily toward visual signals, often neglecting audio cues or failing to correspond audio with the visual source. Finetuning MLLMs on EgoAVU-Instruct effectively addresses this issue, enabling up to 113% performance improvement on EgoAVU-Bench. Such benefits also transfer to other benchmarks such as EgoTempo and EgoIllusion, achieving up to 28% relative performance gain. Code will be released to the community.

EgoAVU：エゴセントリック（自己中心視点）音響-視覚理解

EgoAVU: Egocentric Audio-Visual Understanding

要旨

Support