オーディオ対話：音声と音楽理解のための対話データセット

要旨

既存の音声理解用データセットは、主に単一ターンのインタラクション（例：音声キャプショニング、音声質問応答）に焦点を当てており、自然言語で音声を記述することに限定されているため、対話を通じた音声理解が制限されています。このギャップを埋めるため、私たちは「Audio Dialogues」を導入しました。これは、一般的な音響と音楽を含む163.8kサンプルの多ターン対話データセットです。対話に加えて、Audio Dialoguesには複数の入力音声を理解し比較するための質問応答ペアも含まれています。Audio Dialoguesは、プロンプトベースのアプローチと既存データセットのキャプション注釈を活用し、大規模言語モデル（LLM）を使用して多ターン対話を生成します。私たちは、提案したデータセット上で既存の音声拡張大規模言語モデルを評価し、Audio Dialoguesの複雑さと適用性を実証します。データセット生成のためのコードは公開されます。詳細なプロンプトと生成された対話は、デモウェブサイトhttps://audiodialogues.github.io/で確認できます。

English

Existing datasets for audio understanding primarily focus on single-turn interactions (i.e. audio captioning, audio question answering) for describing audio in natural language, thus limiting understanding audio via interactive dialogue. To address this gap, we introduce Audio Dialogues: a multi-turn dialogue dataset containing 163.8k samples for general audio sounds and music. In addition to dialogues, Audio Dialogues also has question-answer pairs to understand and compare multiple input audios together. Audio Dialogues leverages a prompting-based approach and caption annotations from existing datasets to generate multi-turn dialogues using a Large Language Model (LLM). We evaluate existing audio-augmented large language models on our proposed dataset to demonstrate the complexity and applicability of Audio Dialogues. Our code for generating the dataset will be made publicly available. Detailed prompts and generated dialogues can be found on the demo website https://audiodialogues.github.io/.

オーディオ対話：音声と音楽理解のための対話データセット

Audio Dialogues: Dialogues dataset for audio and music understanding

要旨

Support