대규모 파운데이션 모델에서의 오디오-비주얼 인텔리전스

초록

오디오-비주얼 인텔리전스(AVI)는 인공지능의 핵심 프론티어로 부상하며 청각과 시각 양상을 연결함으로써 다중모드 현실 세계에서 지각, 생성, 상호작용이 가능한 기계를 구현하는 분야입니다. 대규모 파운데이션 모델 시대에 오디오와 비전의 통합 모델링은 이해뿐만 아니라 동적이며 시간 기반 신호에 대한 제어 가능한 생성 및 추론을 위해 점점 더 중요해지고 있습니다. Meta MovieGen과 Google Veo-3과 같은 최근 발전은 방대한 다중모드 데이터로 학습하는 통합 오디오-비전 아키텍처에 대한 산학계의 관심이 높아지고 있음을 보여줍니다. 그러나 빠른 발전에도 불구하고, 다양한 작업, 일관성 없는 분류 체계, 이질적인 평가 관행으로 인해 체계적인 비교와 지식 통합이 저해되어 관련 연구는 여전히 파편화된 상태입니다. 본 설문 논문은 대규모 파운데이션 모델의 관점에서 AVI에 대한 최초의 포괄적인 검토를 제공합니다. 우리는 이해(예: 음성 인식, 음원 위치 추정)부터 생성(예: 오디오 기반 비디오 합성, 비디오-투-오디오), 상호작용(예: 대화, 구현형 또는 에이전시 인터페이스)에 이르는 광범위한 AVI 작업 영역을 아우르는 통합 분류 체계를 확립합니다. 또한 모달리티 토큰화, 교차 모달리티 융합, 자기회귀 및 확산 기반 생성, 대규모 사전 학습, 지시 정렬, 선호도 최적화를 포함한 방법론적 기초를 종합합니다. 나아가 대표적인 데이터셋, 벤치마크, 평가 지표를 체계화하여 작업군 전반에 걸친 구조化的 비교를 제시하고 동기화, 공간 추론, 제어 가능성, 안전성 분야의 미해결 과제를 규명합니다. 이처럼 급속히 확장되는 분야를 일관된 프레임워크로 통합함으로써, 본 설문 논문은 대규모 AVI의 향후 연구를 위한 기초 참고자료로 활용되고자 합니다.

English

Audio-Visual Intelligence (AVI) has emerged as a central frontier in artificial intelligence, bridging auditory and visual modalities to enable machines that can perceive, generate, and interact in the multimodal real world. In the era of large foundation models, joint modeling of audio and vision has become increasingly crucial, i.e., not only for understanding but also for controllable generation and reasoning across dynamic, temporally grounded signals. Recent advances, such as Meta MovieGen and Google Veo-3, highlight the growing industrial and academic focus on unified audio-vision architectures that learn from massive multimodal data. However, despite rapid progress, the literature remains fragmented, spanning diverse tasks, inconsistent taxonomies, and heterogeneous evaluation practices that impede systematic comparison and knowledge integration. This survey provides the first comprehensive review of AVI through the lens of large foundation models. We establish a unified taxonomy covering the broad landscape of AVI tasks, ranging from understanding (e.g., speech recognition, sound localization) to generation (e.g., audio-driven video synthesis, video-to-audio) and interaction (e.g., dialogue, embodied, or agentic interfaces). We synthesize methodological foundations, including modality tokenization, cross-modal fusion, autoregressive and diffusion-based generation, large-scale pretraining, instruction alignment, and preference optimization. Furthermore, we curate representative datasets, benchmarks, and evaluation metrics, offering a structured comparison across task families and identifying open challenges in synchronization, spatial reasoning, controllability, and safety. By consolidating this rapidly expanding field into a coherent framework, this survey aims to serve as a foundational reference for future research on large-scale AVI.

대규모 파운데이션 모델에서의 오디오-비주얼 인텔리전스

Audio-Visual Intelligence in Large Foundation Models

초록

Support