ChatPaper.aiChatPaper

大型基础模型中的视听智能

Audio-Visual Intelligence in Large Foundation Models

May 5, 2026
作者: You Qin, Kai Liu, Shengqiong Wu, Kai Wang, Shijian Deng, Yapeng Tian, Junbin Xiao, Yazhou Xing, Yinghao Ma, Bobo Li, Roger Zimmermann, Lei Cui, Furu Wei, Jiebo Luo, Hao Fei
cs.AI

摘要

视听智能(AVI)已成为人工智能的核心前沿领域,通过融合听觉与视觉模态,使机器能够感知、生成并交互于多模态现实世界。在大规模基础模型时代,音频与视觉的联合建模变得日益关键——不仅限于理解任务,更涵盖动态时序信号的可控生成与推理。近期如Meta MovieGen和谷歌Veo-3等突破性进展,凸显了产业界与学术界对基于海量多模态数据的统一音视架构的聚焦。然而尽管发展迅速,该领域研究仍呈现碎片化态势:任务类型繁杂、分类标准不一、评估方法各异,阻碍了系统化比较与知识整合。本文首次通过大规模基础模型的视角对AVI进行全面综述,建立了覆盖理解(如语音识别、声源定位)、生成(如音频驱动视频合成、视频转音频)与交互(如对话式、具身式或智能体接口)三大范畴的统一分类体系。我们系统梳理了模态标记化、跨模态融合、自回归与扩散生成、大规模预训练、指令对齐及偏好优化等方法论基础,并整合代表性数据集、基准测试与评估指标,实现跨任务家族的结构化对比,同时揭示了同步性、空间推理、可控性与安全性等开放挑战。通过将这一快速扩张的领域整合为连贯框架,本综述旨在为大规模AVI的未来研究奠定基础性参考。
English
Audio-Visual Intelligence (AVI) has emerged as a central frontier in artificial intelligence, bridging auditory and visual modalities to enable machines that can perceive, generate, and interact in the multimodal real world. In the era of large foundation models, joint modeling of audio and vision has become increasingly crucial, i.e., not only for understanding but also for controllable generation and reasoning across dynamic, temporally grounded signals. Recent advances, such as Meta MovieGen and Google Veo-3, highlight the growing industrial and academic focus on unified audio-vision architectures that learn from massive multimodal data. However, despite rapid progress, the literature remains fragmented, spanning diverse tasks, inconsistent taxonomies, and heterogeneous evaluation practices that impede systematic comparison and knowledge integration. This survey provides the first comprehensive review of AVI through the lens of large foundation models. We establish a unified taxonomy covering the broad landscape of AVI tasks, ranging from understanding (e.g., speech recognition, sound localization) to generation (e.g., audio-driven video synthesis, video-to-audio) and interaction (e.g., dialogue, embodied, or agentic interfaces). We synthesize methodological foundations, including modality tokenization, cross-modal fusion, autoregressive and diffusion-based generation, large-scale pretraining, instruction alignment, and preference optimization. Furthermore, we curate representative datasets, benchmarks, and evaluation metrics, offering a structured comparison across task families and identifying open challenges in synchronization, spatial reasoning, controllability, and safety. By consolidating this rapidly expanding field into a coherent framework, this survey aims to serve as a foundational reference for future research on large-scale AVI.
PDF172May 9, 2026