Pengi:用于音频任务的音频语言模型
Pengi: An Audio Language Model for Audio Tasks
May 19, 2023
作者: Soham Deshmukh, Benjamin Elizalde, Rita Singh, Huaming Wang
cs.AI
摘要
在音频处理领域,迁移学习促进了自监督学习和零样本学习技术的兴起。这些方法导致了多功能模型的发展,能够处理各种任务,并提供最先进的性能。然而,当前模型固有地缺乏产生开放式任务所需语言的能力,比如音频字幕或音频问答。我们引入了一种新颖的音频语言模型Pengi,它利用迁移学习,将所有音频任务构建为文本生成任务。它以音频录音和文本作为输入,并生成自由文本作为输出。输入音频由音频编码器表示为连续嵌入序列。文本编码器对应的文本输入执行相同操作。这两个序列被合并为前缀,用于提示一个预训练的冻结语言模型。Pengi的统一架构使其能够处理开放式任务和封闭式任务,无需额外微调或特定任务扩展。在评估了22个下游任务后,我们的方法在其中几个任务中取得了最先进的性能。我们的结果表明,将语言模型与音频模型连接起来是迈向通用音频理解的重要一步。
English
In the domain of audio processing, Transfer Learning has facilitated the rise
of Self-Supervised Learning and Zero-Shot Learning techniques. These approaches
have led to the development of versatile models capable of tackling a wide
array of tasks, while delivering state-of-the-art performance. However, current
models inherently lack the capacity to produce the requisite language for
open-ended tasks, such as Audio Captioning or Audio Question & Answering. We
introduce Pengi, a novel Audio Language Model that leverages Transfer Learning
by framing all audio tasks as text-generation tasks. It takes as input, an
audio recording, and text, and generates free-form text as output. The input
audio is represented as a sequence of continuous embeddings by an audio
encoder. A text encoder does the same for the corresponding text input. Both
sequences are combined as a prefix to prompt a pre-trained frozen language
model. The unified architecture of Pengi enables open-ended tasks and
close-ended tasks without any additional fine-tuning or task-specific
extensions. When evaluated on 22 downstream tasks, our approach yields
state-of-the-art performance in several of them. Our results show that
connecting language models with audio models is a major step towards
general-purpose audio understanding