ChatPaper.aiChatPaper

Audio-FLAN:初步版本發布

Audio-FLAN: A Preliminary Release

February 23, 2025
作者: Liumeng Xue, Ziya Zhou, Jiahao Pan, Zixuan Li, Shuai Fan, Yinghao Ma, Sitong Cheng, Dongchao Yang, Haohan Guo, Yujia Xiao, Xinsheng Wang, Zixuan Shen, Chuanbo Zhu, Xinshen Zhang, Tianchi Liu, Ruibin Yuan, Zeyue Tian, Haohe Liu, Emmanouil Benetos, Ge Zhang, Yike Guo, Wei Xue
cs.AI

摘要

近期音訊標記化技術的進步,顯著提升了大型語言模型(LLMs)整合音訊能力的效果。然而,音訊理解與生成往往被視為獨立的任務,這阻礙了真正統一的音訊語言模型的發展。雖然指令微調在提升文本與視覺領域的泛化能力和零樣本學習方面展現了顯著成效,但其在音訊領域的應用仍大多未被探索。一個主要障礙是缺乏統合音訊理解與生成的全面數據集。為此,我們推出了Audio-FLAN,這是一個大規模的指令微調數據集,涵蓋語音、音樂及聲音領域的80多種多樣任務,包含超過一億個實例。Audio-FLAN為統一的音訊語言模型奠定了基礎,這些模型能夠以零樣本的方式,無縫處理跨多種音訊領域的理解(如轉錄、理解)與生成(如語音、音樂、聲音)任務。Audio-FLAN數據集現已於HuggingFace與GitHub上開放,並將持續更新。
English
Recent advancements in audio tokenization have significantly enhanced the integration of audio capabilities into large language models (LLMs). However, audio understanding and generation are often treated as distinct tasks, hindering the development of truly unified audio-language models. While instruction tuning has demonstrated remarkable success in improving generalization and zero-shot learning across text and vision, its application to audio remains largely unexplored. A major obstacle is the lack of comprehensive datasets that unify audio understanding and generation. To address this, we introduce Audio-FLAN, a large-scale instruction-tuning dataset covering 80 diverse tasks across speech, music, and sound domains, with over 100 million instances. Audio-FLAN lays the foundation for unified audio-language models that can seamlessly handle both understanding (e.g., transcription, comprehension) and generation (e.g., speech, music, sound) tasks across a wide range of audio domains in a zero-shot manner. The Audio-FLAN dataset is available on HuggingFace and GitHub and will be continuously updated.

Summary

AI-Generated Summary

PDF372February 25, 2025