오디오 대화: 오디오 및 음악 이해를 위한 대화 데이터셋

초록

기존의 오디오 이해를 위한 데이터셋은 주로 단일 턴 상호작용(예: 오디오 캡셔닝, 오디오 질의응답)에 초점을 맞춰 자연어로 오디오를 설명하는 데 그치며, 이로 인해 대화형 방식으로 오디오를 이해하는 데 한계가 있었습니다. 이러한 격차를 해결하기 위해, 우리는 일반적인 소리와 음악을 포함한 163.8k개의 샘플로 구성된 다중 턴 대화 데이터셋인 '오디오 대화(Audio Dialogues)'를 소개합니다. 오디오 대화는 대화뿐만 아니라 여러 입력 오디오를 함께 이해하고 비교하기 위한 질문-답변 쌍도 포함하고 있습니다. 오디오 대화는 프롬프트 기반 접근 방식을 활용하고 기존 데이터셋의 캡션 주석을 사용하여 대형 언어 모델(LLM)을 통해 다중 턴 대화를 생성합니다. 우리는 제안된 데이터셋을 기반으로 기존의 오디오 증강 대형 언어 모델을 평가하여 오디오 대화의 복잡성과 적용 가능성을 입증합니다. 데이터셋 생성 코드는 공개될 예정이며, 상세한 프롬프트와 생성된 대화는 데모 웹사이트 https://audiodialogues.github.io/에서 확인할 수 있습니다.

English

Existing datasets for audio understanding primarily focus on single-turn interactions (i.e. audio captioning, audio question answering) for describing audio in natural language, thus limiting understanding audio via interactive dialogue. To address this gap, we introduce Audio Dialogues: a multi-turn dialogue dataset containing 163.8k samples for general audio sounds and music. In addition to dialogues, Audio Dialogues also has question-answer pairs to understand and compare multiple input audios together. Audio Dialogues leverages a prompting-based approach and caption annotations from existing datasets to generate multi-turn dialogues using a Large Language Model (LLM). We evaluate existing audio-augmented large language models on our proposed dataset to demonstrate the complexity and applicability of Audio Dialogues. Our code for generating the dataset will be made publicly available. Detailed prompts and generated dialogues can be found on the demo website https://audiodialogues.github.io/.

오디오 대화: 오디오 및 음악 이해를 위한 대화 데이터셋

Audio Dialogues: Dialogues dataset for audio and music understanding

초록

Support