SALMONN：朝向大型語言模型的通用聽覺能力

摘要

聽覺被認為是人工智慧（AI）代理在現實世界中的一項基本能力，指的是感知和理解包含至少三種類型聲音的一般聽覺信息：語音、音頻事件和音樂。在本文中，我們提出了SALMONN，一個語音音頻語言音樂開放神經網絡，通過將預訓練的基於文本的大型語言模型（LLM）與語音和音頻編碼器集成到單一多模型中構建而成。SALMONN使LLM能夠直接處理和理解一般音頻輸入，在訓練中用於多項語音和音頻任務，如自動語音識別和翻譯、基於聽覺信息的問答、情感識別、語者驗證、音樂和音頻字幕等方面取得競爭性表現。SALMONN還具有多樣的新興能力，這些能力在訓練中並未見過，包括但不限於對未訓練語言的語音翻譯、基於語音的槽填充、基於口語查詢的問答、基於音頻的故事講述和語音音頻共推理等。跨模態新興能力的存在得到了研究，並提出了一種新穎的少樣本激活調整方法來激活SALMONN的這些能力。據我們所知，SALMONN是其類型的第一個模型，可視為邁向具有通用聽覺能力的人工智慧的一步。SALMONN的互動演示可在\url{https://github.com/bytedance/SALMONN}上找到，接受後將釋放訓練代碼和模型檢查點。

English

Hearing is arguably an essential ability of artificial intelligence (AI) agents in the physical world, which refers to the perception and understanding of general auditory information consisting of at least three types of sounds: speech, audio events, and music. In this paper, we propose SALMONN, a speech audio language music open neural network, built by integrating a pre-trained text-based large language model (LLM) with speech and audio encoders into a single multimodal model. SALMONN enables the LLM to directly process and understand general audio inputs and achieve competitive performances on a number of speech and audio tasks used in training, such as automatic speech recognition and translation, auditory-information-based question answering, emotion recognition, speaker verification, and music and audio captioning etc. SALMONN also has a diverse set of emergent abilities unseen in the training, which includes but is not limited to speech translation to untrained languages, speech-based slot filling, spoken-query-based question answering, audio-based storytelling, and speech audio co-reasoning etc. The presence of the cross-modal emergent abilities is studied, and a novel few-shot activation tuning approach is proposed to activate such abilities of SALMONN. To our knowledge, SALMONN is the first model of its type and can be regarded as a step towards AI with generic hearing abilities. An interactive demo of SALMONN is available at \url{https://github.com/bytedance/SALMONN}, and the training code and model checkpoints will be released upon acceptance.

SALMONN：朝向大型語言模型的通用聽覺能力

SALMONN: Towards Generic Hearing Abilities for Large Language Models

摘要

Support