오디오-FLAN: 초기 릴리스

초록

최근 오디오 토큰화 기술의 발전으로 대규모 언어 모델(LLM)에 오디오 기능을 통합하는 데 있어 상당한 진전이 이루어졌습니다. 그러나 오디오 이해와 생성은 종종 별개의 작업으로 취급되어, 진정한 통합형 오디오-언어 모델의 개발을 방해하고 있습니다. 인스트럭션 튜닝은 텍스트와 비전 분야에서 일반화 및 제로샷 학습을 개선하는 데 있어 놀라운 성공을 거두었지만, 오디오 분야에의 적용은 아직까지 크게 탐구되지 않았습니다. 주요 장애물은 오디오 이해와 생성을 통합하는 포괄적인 데이터셋의 부재입니다. 이를 해결하기 위해, 우리는 음성, 음악, 소리 도메인에 걸쳐 80가지 다양한 작업과 1억 개 이상의 인스턴스를 포함하는 대규모 인스트럭션 튜닝 데이터셋인 Audio-FLAN을 소개합니다. Audio-FLAN은 다양한 오디오 도메인에서 이해(예: 전사, 이해)와 생성(예: 음성, 음악, 소리) 작업을 제로샷 방식으로 원활하게 처리할 수 있는 통합형 오디오-언어 모델의 기반을 마련합니다. Audio-FLAN 데이터셋은 HuggingFace와 GitHub에서 이용 가능하며, 지속적으로 업데이트될 예정입니다.

English

Recent advancements in audio tokenization have significantly enhanced the integration of audio capabilities into large language models (LLMs). However, audio understanding and generation are often treated as distinct tasks, hindering the development of truly unified audio-language models. While instruction tuning has demonstrated remarkable success in improving generalization and zero-shot learning across text and vision, its application to audio remains largely unexplored. A major obstacle is the lack of comprehensive datasets that unify audio understanding and generation. To address this, we introduce Audio-FLAN, a large-scale instruction-tuning dataset covering 80 diverse tasks across speech, music, and sound domains, with over 100 million instances. Audio-FLAN lays the foundation for unified audio-language models that can seamlessly handle both understanding (e.g., transcription, comprehension) and generation (e.g., speech, music, sound) tasks across a wide range of audio domains in a zero-shot manner. The Audio-FLAN dataset is available on HuggingFace and GitHub and will be continuously updated.