Pengi：用于音频任务的音频语言模型

摘要

在音频处理领域，迁移学习促进了自监督学习和零样本学习技术的兴起。这些方法导致了多功能模型的发展，能够处理各种任务，并提供最先进的性能。然而，当前模型固有地缺乏产生开放式任务所需语言的能力，比如音频字幕或音频问答。我们引入了一种新颖的音频语言模型Pengi，它利用迁移学习，将所有音频任务构建为文本生成任务。它以音频录音和文本作为输入，并生成自由文本作为输出。输入音频由音频编码器表示为连续嵌入序列。文本编码器对应的文本输入执行相同操作。这两个序列被合并为前缀，用于提示一个预训练的冻结语言模型。Pengi的统一架构使其能够处理开放式任务和封闭式任务，无需额外微调或特定任务扩展。在评估了22个下游任务后，我们的方法在其中几个任务中取得了最先进的性能。我们的结果表明，将语言模型与音频模型连接起来是迈向通用音频理解的重要一步。

English

In the domain of audio processing, Transfer Learning has facilitated the rise of Self-Supervised Learning and Zero-Shot Learning techniques. These approaches have led to the development of versatile models capable of tackling a wide array of tasks, while delivering state-of-the-art performance. However, current models inherently lack the capacity to produce the requisite language for open-ended tasks, such as Audio Captioning or Audio Question & Answering. We introduce Pengi, a novel Audio Language Model that leverages Transfer Learning by framing all audio tasks as text-generation tasks. It takes as input, an audio recording, and text, and generates free-form text as output. The input audio is represented as a sequence of continuous embeddings by an audio encoder. A text encoder does the same for the corresponding text input. Both sequences are combined as a prefix to prompt a pre-trained frozen language model. The unified architecture of Pengi enables open-ended tasks and close-ended tasks without any additional fine-tuning or task-specific extensions. When evaluated on 22 downstream tasks, our approach yields state-of-the-art performance in several of them. Our results show that connecting language models with audio models is a major step towards general-purpose audio understanding

Pengi：用于音频任务的音频语言模型

Pengi: An Audio Language Model for Audio Tasks

摘要

Support