SALMONN：面向大型语言模型的通用听觉能力

摘要

听觉可以说是人工智能（AI）代理在现实世界中的一项关键能力，指的是对包括至少三种类型声音在内的一般听觉信息的感知和理解：语音、音频事件和音乐。在本文中，我们提出了SALMONN，即语音音频语言音乐开放神经网络，通过将预训练的基于文本的大型语言模型（LLM）与语音和音频编码器集成到单一的多模态模型中构建而成。SALMONN使LLM能够直接处理和理解一般音频输入，并在训练中使用的多项语音和音频任务上取得竞争性表现，例如自动语音识别和翻译、基于听觉信息的问答、情感识别、说话者验证以及音乐和音频字幕等。SALMONN还具有多样的新兴能力，这些能力在训练中未曾见过，包括但不限于对未训练语言的语音翻译、基于语音的槽填充、基于口头查询的问答、基于音频的叙事，以及语音音频共推理等。我们研究了跨模态新兴能力的存在，并提出了一种新颖的少样本激活调整方法来激活SALMONN的这些能力。据我们所知，SALMONN是其类型的第一个模型，可视为具有通用听觉能力的AI迈出的一步。SALMONN的交互式演示可在\url{https://github.com/bytedance/SALMONN}上找到，训练代码和模型检查点将在接受后发布。

English

Hearing is arguably an essential ability of artificial intelligence (AI) agents in the physical world, which refers to the perception and understanding of general auditory information consisting of at least three types of sounds: speech, audio events, and music. In this paper, we propose SALMONN, a speech audio language music open neural network, built by integrating a pre-trained text-based large language model (LLM) with speech and audio encoders into a single multimodal model. SALMONN enables the LLM to directly process and understand general audio inputs and achieve competitive performances on a number of speech and audio tasks used in training, such as automatic speech recognition and translation, auditory-information-based question answering, emotion recognition, speaker verification, and music and audio captioning etc. SALMONN also has a diverse set of emergent abilities unseen in the training, which includes but is not limited to speech translation to untrained languages, speech-based slot filling, spoken-query-based question answering, audio-based storytelling, and speech audio co-reasoning etc. The presence of the cross-modal emergent abilities is studied, and a novel few-shot activation tuning approach is proposed to activate such abilities of SALMONN. To our knowledge, SALMONN is the first model of its type and can be regarded as a step towards AI with generic hearing abilities. An interactive demo of SALMONN is available at \url{https://github.com/bytedance/SALMONN}, and the training code and model checkpoints will be released upon acceptance.

SALMONN：面向大型语言模型的通用听觉能力

SALMONN: Towards Generic Hearing Abilities for Large Language Models

摘要

Support