音频火烈鸟:一种具有少样本学习和对话能力的新型音频语言模型
Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities
February 2, 2024
作者: Zhifeng Kong, Arushi Goel, Rohan Badlani, Wei Ping, Rafael Valle, Bryan Catanzaro
cs.AI
摘要
将大型语言模型(LLMs)扩展到理解音频,包括非语音声音和非语言言语,对于LLMs的多样实际应用至关重要。在本文中,我们提出了一种名为Audio Flamingo的新型音频语言模型,具有以下特点:1)强大的音频理解能力,2)通过上下文学习和检索快速适应未见任务的能力,以及3)强大的多轮对话能力。我们引入了一系列训练技术、架构设计和数据策略,以增强我们的模型具备这些能力。通过在各种音频理解任务上进行广泛评估,确认了我们方法的有效性,创立了新的最先进基准。
English
Augmenting large language models (LLMs) to understand audio -- including
non-speech sounds and non-verbal speech -- is critically important for diverse
real-world applications of LLMs. In this paper, we propose Audio Flamingo, a
novel audio language model with 1) strong audio understanding abilities, 2) the
ability to quickly adapt to unseen tasks via in-context learning and retrieval,
and 3) strong multi-turn dialogue abilities. We introduce a series of training
techniques, architecture design, and data strategies to enhance our model with
these abilities. Extensive evaluations across various audio understanding tasks
confirm the efficacy of our method, setting new state-of-the-art benchmarks.