EgoAVU: 자기 중심적 오디오-비주얼 이해

초록

자기 중심적 비디오 이해는 구현형 인텔리전스에 있어 핵심적인 역할을 합니다. 최근 다중 모달 대규모 언어 모델(MLLM)은 시각 및 청각 입력을 모두 처리할 수 있지만, 일관된 결합 모달리티 정보를 포함한 텍스트 레이블 획득의 어려움으로 인해 MLLM이 자기 중심적 비디오에서 양쪽 모달리티를 통합적으로 이해할 수 있는지 여부는 충분히 연구되지 않았습니다. 이 문제를 해결하기 위해 우리는 자기 중심적 오디오-비디오 내레이션, 질문 및 답변을 자동 생성하는 확장 가능한 데이터 엔진인 EgoAVU를 소개합니다. EgoAVU는 인간 내레이션을 다중 모달 컨텍스트로 풍부하게 하고 교차 모달 상관관계 모델링을 통해 오디오-비디오 내레이션을 생성합니다. 토큰 기반 비디오 필터링과 모듈식 그래프 기반 정제 과정을 통해 데이터 다양성과 품질을 모두 보장합니다. EgoAVU를 활용하여 300만 개의 샘플로 구성된 대규모 훈련 데이터셋 EgoAVU-Instruct와 다양한 작업을 포괄하는 수동 검증 평가 세트 EgoAVU-Bench를 구축했습니다. EgoAVU-Bench는 기존 MLLM의 한계를 명확히 보여줍니다. 즉, 이들은 시각 신호에 지나치게 편향되어 오디오 단서를 종종 간과하거나 오디오와 시각 출처를 연관 짓지 못하는 경우가 많습니다. EgoAVU-Instruct로 MLLM을 미세 조정하면 이 문제를 효과적으로 해결하여 EgoAVU-Bench에서 최대 113%의 성능 향상을 달성할 수 있습니다. 이러한 이점은 EgoTempo 및 EgoIllusion과 같은 다른 벤치마크로도 전이되어 최대 28%의 상대적 성능 향상을 이루었습니다. 코드는 커뮤니티에 공개될 예정입니다.

English

Understanding egocentric videos plays a vital role for embodied intelligence. Recent multi-modal large language models (MLLMs) can accept both visual and audio inputs. However, due to the challenge of obtaining text labels with coherent joint-modality information, whether MLLMs can jointly understand both modalities in egocentric videos remains under-explored. To address this problem, we introduce EgoAVU, a scalable data engine to automatically generate egocentric audio-visual narrations, questions, and answers. EgoAVU enriches human narrations with multimodal context and generates audio-visual narrations through cross-modal correlation modeling. Token-based video filtering and modular, graph-based curation ensure both data diversity and quality. Leveraging EgoAVU, we construct EgoAVU-Instruct, a large-scale training dataset of 3M samples, and EgoAVU-Bench, a manually verified evaluation split covering diverse tasks. EgoAVU-Bench clearly reveals the limitations of existing MLLMs: they bias heavily toward visual signals, often neglecting audio cues or failing to correspond audio with the visual source. Finetuning MLLMs on EgoAVU-Instruct effectively addresses this issue, enabling up to 113% performance improvement on EgoAVU-Bench. Such benefits also transfer to other benchmarks such as EgoTempo and EgoIllusion, achieving up to 28% relative performance gain. Code will be released to the community.

EgoAVU: 자기 중심적 오디오-비주얼 이해

EgoAVU: Egocentric Audio-Visual Understanding

초록

Support