오디오 플라밍고: 퓨샷 학습과 대화 능력을 갖춘 새로운 오디오 언어 모델

초록

대규모 언어 모델(LLM)이 음성을 포함한 오디오(비언어적 음성 및 비음성 소리 포함)를 이해하도록 확장하는 것은 LLM의 다양한 실제 응용에 있어 매우 중요합니다. 본 논문에서는 1) 강력한 오디오 이해 능력, 2) 컨텍스트 학습 및 검색을 통해 미지의 작업에 빠르게 적응할 수 있는 능력, 3) 강력한 다중 턴 대화 능력을 갖춘 새로운 오디오 언어 모델인 Audio Flamingo를 제안합니다. 우리는 이러한 능력을 모델에 부여하기 위해 일련의 훈련 기법, 아키텍처 설계 및 데이터 전략을 소개합니다. 다양한 오디오 이해 작업에 걸친 광범위한 평가를 통해 우리의 방법의 효율성을 확인하고, 새로운 최첨단 벤치마크를 설정하였습니다.

English

Augmenting large language models (LLMs) to understand audio -- including non-speech sounds and non-verbal speech -- is critically important for diverse real-world applications of LLMs. In this paper, we propose Audio Flamingo, a novel audio language model with 1) strong audio understanding abilities, 2) the ability to quickly adapt to unseen tasks via in-context learning and retrieval, and 3) strong multi-turn dialogue abilities. We introduce a series of training techniques, architecture design, and data strategies to enhance our model with these abilities. Extensive evaluations across various audio understanding tasks confirm the efficacy of our method, setting new state-of-the-art benchmarks.

오디오 플라밍고: 퓨샷 학습과 대화 능력을 갖춘 새로운 오디오 언어 모델

Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities

초록

Support