SALMONN: 대규모 언어 모델을 위한 일반적인 청각 능력 향상을 위한 연구

초록

청각은 물리적 세계에서 인공지능(AI) 에이전트의 필수적인 능력으로 간주될 수 있으며, 이는 최소한 세 가지 유형의 소리(음성, 오디오 이벤트, 음악)로 구성된 일반적인 청각 정보를 인지하고 이해하는 것을 의미한다. 본 논문에서는 사전 학습된 텍스트 기반 대형 언어 모델(LLM)과 음성 및 오디오 인코더를 단일 다중모달 모델로 통합하여 구축한 SALMONN(Speech Audio Language Music Open Neural Network)을 제안한다. SALMONN은 LLM이 일반 오디오 입력을 직접 처리하고 이해할 수 있게 하며, 자동 음성 인식 및 번역, 청각 정보 기반 질의응답, 감정 인식, 화자 검증, 음악 및 오디오 캡셔닝 등 훈련에 사용된 다양한 음성 및 오디오 작업에서 경쟁력 있는 성능을 달성한다. 또한 SALMONN은 훈련 중에는 볼 수 없었던 다양한 창발적 능력을 보유하고 있으며, 이는 훈련되지 않은 언어로의 음성 번역, 음성 기반 슬롯 채우기, 음성 질의 기반 질의응답, 오디오 기반 스토리텔링, 음성 및 오디오 공동 추론 등을 포함한다. 이러한 교차모달 창발적 능력의 존재를 연구하고, SALMONN의 이러한 능력을 활성화하기 위한 새로운 소샷 활성화 튜닝 접근법을 제안한다. 우리가 아는 한, SALMONN은 이 유형의 첫 번째 모델이며, 일반적인 청각 능력을 가진 AI로 나아가는 한 걸음으로 간주될 수 있다. SALMONN의 인터랙티브 데모는 \url{https://github.com/bytedance/SALMONN}에서 확인할 수 있으며, 훈련 코드와 모델 체크포인트는 논문 채택 시 공개될 예정이다.

English

Hearing is arguably an essential ability of artificial intelligence (AI) agents in the physical world, which refers to the perception and understanding of general auditory information consisting of at least three types of sounds: speech, audio events, and music. In this paper, we propose SALMONN, a speech audio language music open neural network, built by integrating a pre-trained text-based large language model (LLM) with speech and audio encoders into a single multimodal model. SALMONN enables the LLM to directly process and understand general audio inputs and achieve competitive performances on a number of speech and audio tasks used in training, such as automatic speech recognition and translation, auditory-information-based question answering, emotion recognition, speaker verification, and music and audio captioning etc. SALMONN also has a diverse set of emergent abilities unseen in the training, which includes but is not limited to speech translation to untrained languages, speech-based slot filling, spoken-query-based question answering, audio-based storytelling, and speech audio co-reasoning etc. The presence of the cross-modal emergent abilities is studied, and a novel few-shot activation tuning approach is proposed to activate such abilities of SALMONN. To our knowledge, SALMONN is the first model of its type and can be regarded as a step towards AI with generic hearing abilities. An interactive demo of SALMONN is available at \url{https://github.com/bytedance/SALMONN}, and the training code and model checkpoints will be released upon acceptance.

SALMONN: 대규모 언어 모델을 위한 일반적인 청각 능력 향상을 위한 연구

SALMONN: Towards Generic Hearing Abilities for Large Language Models

초록

Support