SALMONN: 大規模言語モデルにおける汎用的な聴覚能力の実現に向けて

要旨

聴覚は、物理世界における人工知能（AI）エージェントにとって不可欠な能力であると言える。これは、少なくとも3種類の音（音声、音響イベント、音楽）から構成される一般的な聴覚情報の知覚と理解を指す。本論文では、事前学習済みのテキストベースの大規模言語モデル（LLM）と音声・音響エンコーダを統合して構築されたマルチモーダルモデルであるSALMONN（Speech Audio Language Music Open Neural Network）を提案する。SALMONNは、LLMが一般的な音声入力を直接処理・理解し、自動音声認識や翻訳、聴覚情報に基づく質問応答、感情認識、話者認証、音楽および音響キャプション生成など、訓練で使用される多くの音声・音響タスクにおいて競争力のある性能を達成する。さらに、SALMONNは訓練中に見られなかった多様な創発能力を有しており、未学習言語への音声翻訳、音声ベースのスロットフィリング、音声クエリに基づく質問応答、音響ベースのストーリーテリング、音声・音響の共推論などが含まれる。これらのクロスモーダルな創発能力の存在を研究し、SALMONNのそのような能力を活性化するための新しいFew-Shot活性化チューニング手法を提案する。私たちの知る限り、SALMONNはこの種の最初のモデルであり、汎用的な聴覚能力を持つAIへの一歩と見なすことができる。SALMONNのインタラクティブデモは\url{https://github.com/bytedance/SALMONN}で利用可能であり、訓練コードとモデルチェックポイントは受理後に公開される予定である。

English

Hearing is arguably an essential ability of artificial intelligence (AI) agents in the physical world, which refers to the perception and understanding of general auditory information consisting of at least three types of sounds: speech, audio events, and music. In this paper, we propose SALMONN, a speech audio language music open neural network, built by integrating a pre-trained text-based large language model (LLM) with speech and audio encoders into a single multimodal model. SALMONN enables the LLM to directly process and understand general audio inputs and achieve competitive performances on a number of speech and audio tasks used in training, such as automatic speech recognition and translation, auditory-information-based question answering, emotion recognition, speaker verification, and music and audio captioning etc. SALMONN also has a diverse set of emergent abilities unseen in the training, which includes but is not limited to speech translation to untrained languages, speech-based slot filling, spoken-query-based question answering, audio-based storytelling, and speech audio co-reasoning etc. The presence of the cross-modal emergent abilities is studied, and a novel few-shot activation tuning approach is proposed to activate such abilities of SALMONN. To our knowledge, SALMONN is the first model of its type and can be regarded as a step towards AI with generic hearing abilities. An interactive demo of SALMONN is available at \url{https://github.com/bytedance/SALMONN}, and the training code and model checkpoints will be released upon acceptance.

SALMONN: 大規模言語モデルにおける汎用的な聴覚能力の実現に向けて

SALMONN: Towards Generic Hearing Abilities for Large Language Models

要旨

Support