音訊火烈鳥：一種具有少樣本學習和對話能力的新型音訊語言模型

摘要

對於大型語言模型（LLMs）進行擴展，以理解音訊，包括非語音聲音和非語言言語，對於LLMs的多樣現實應用至關重要。在本文中，我們提出了一種名為Audio Flamingo的新型音訊語言模型，具備以下特點：1）強大的音訊理解能力，2）通過上下文學習和檢索快速適應未見任務的能力，以及3）強大的多輪對話能力。我們引入了一系列訓練技術、架構設計和數據策略，以增強我們的模型具備這些能力。通過在各種音訊理解任務上進行廣泛評估，證實了我們方法的有效性，並創立了新的最先進基準。

English

Augmenting large language models (LLMs) to understand audio -- including non-speech sounds and non-verbal speech -- is critically important for diverse real-world applications of LLMs. In this paper, we propose Audio Flamingo, a novel audio language model with 1) strong audio understanding abilities, 2) the ability to quickly adapt to unseen tasks via in-context learning and retrieval, and 3) strong multi-turn dialogue abilities. We introduce a series of training techniques, architecture design, and data strategies to enhance our model with these abilities. Extensive evaluations across various audio understanding tasks confirm the efficacy of our method, setting new state-of-the-art benchmarks.

音訊火烈鳥：一種具有少樣本學習和對話能力的新型音訊語言模型

Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities

摘要

Support