大規模基盤モデルにおける音声視覚知能

要旨

視聴覚知能（AVI）は、聴覚モダリティと視覚モダリティを架橋し、マルチモーダルな現実世界において知覚・生成・対話可能な機械を実現する人工知能の中心的なフロンティアとして台頭してきた。大規模基盤モデルの時代において、音声と視覚の統合的モデリングは、動的で時間に根ざした信号に対する理解のみならず、制御可能な生成や推論のためにも、その重要性を増している。MetaのMovieGenやGoogleのVeo-3などの最近の進歩は、大規模なマルチモーダルデータから学習する統合的な音声-視覚アーキテクチャに対する産業界と学界の関心の高まりを象徴している。しかし、急速な進展にもかかわらず、研究文献は多様なタスク、一貫性のない分類体系、異質な評価手法に分散しており、体系的な比較や知見の統合を妨げている。本サーベイは、大規模基盤モデルの観点からAVIを包括的にレビューする初の試みである。我々は、理解（例：音声認識、音源定位）から生成（例：音声駆動ビデオ合成、ビデオから音声への生成）、対話（例：対話型、具身化、あるいはエージェント型インターフェース）に至る広範なAVIタスク群を網羅する統一的な分類体系を確立する。モダリティのトークン化、クロスモーダル融合、自己回帰型および拡散モデルベースの生成、大規模事前学習、指示チューニング、選好最適化といった方法論的基礎を統合的に整理する。さらに、代表的なデータセット、ベンチマーク、評価指標を精選し、タスクファミリー間での構造化された比較を提供するとともに、同期性、空間推論、制御性、安全性における未解決の課題を明らかにする。本サーベイは、この急速に拡大する分野を一貫した枠組みに統合することにより、大規模AVIの将来研究における基礎的参考文献となることを目指す。

English

Audio-Visual Intelligence (AVI) has emerged as a central frontier in artificial intelligence, bridging auditory and visual modalities to enable machines that can perceive, generate, and interact in the multimodal real world. In the era of large foundation models, joint modeling of audio and vision has become increasingly crucial, i.e., not only for understanding but also for controllable generation and reasoning across dynamic, temporally grounded signals. Recent advances, such as Meta MovieGen and Google Veo-3, highlight the growing industrial and academic focus on unified audio-vision architectures that learn from massive multimodal data. However, despite rapid progress, the literature remains fragmented, spanning diverse tasks, inconsistent taxonomies, and heterogeneous evaluation practices that impede systematic comparison and knowledge integration. This survey provides the first comprehensive review of AVI through the lens of large foundation models. We establish a unified taxonomy covering the broad landscape of AVI tasks, ranging from understanding (e.g., speech recognition, sound localization) to generation (e.g., audio-driven video synthesis, video-to-audio) and interaction (e.g., dialogue, embodied, or agentic interfaces). We synthesize methodological foundations, including modality tokenization, cross-modal fusion, autoregressive and diffusion-based generation, large-scale pretraining, instruction alignment, and preference optimization. Furthermore, we curate representative datasets, benchmarks, and evaluation metrics, offering a structured comparison across task families and identifying open challenges in synchronization, spatial reasoning, controllability, and safety. By consolidating this rapidly expanding field into a coherent framework, this survey aims to serve as a foundational reference for future research on large-scale AVI.

大規模基盤モデルにおける音声視覚知能

Audio-Visual Intelligence in Large Foundation Models

要旨

Support