Audio Flamingo: 少数ショット学習と対話能力を備えた新規音声言語モデル

要旨

大規模言語モデル（LLMs）に音声――非言語音や非言語的発話を含む――を理解する能力を付与することは、LLMsの多様な実世界応用において極めて重要である。本論文では、Audio Flamingoを提案する。これは、1）強力な音声理解能力、2）コンテキスト内学習と検索を通じて未見のタスクに迅速に適応する能力、3）強力な多ターン対話能力を備えた新しい音声言語モデルである。これらの能力をモデルに付与するために、一連の訓練技術、アーキテクチャ設計、データ戦略を導入する。様々な音声理解タスクにわたる広範な評価を通じて、本手法の有効性を確認し、新たな最先端のベンチマークを確立する。

English

Augmenting large language models (LLMs) to understand audio -- including non-speech sounds and non-verbal speech -- is critically important for diverse real-world applications of LLMs. In this paper, we propose Audio Flamingo, a novel audio language model with 1) strong audio understanding abilities, 2) the ability to quickly adapt to unseen tasks via in-context learning and retrieval, and 3) strong multi-turn dialogue abilities. We introduce a series of training techniques, architecture design, and data strategies to enhance our model with these abilities. Extensive evaluations across various audio understanding tasks confirm the efficacy of our method, setting new state-of-the-art benchmarks.

Audio Flamingo: 少数ショット学習と対話能力を備えた新規音声言語モデル

Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities

要旨

Support