Qwen-Audio：通过统一的大规模音频-语言模型推动通用音频理解

摘要

最近，指令遵循的音频语言模型在与人类进行音频交互方面受到了广泛关注。然而，缺乏能够处理各种音频类型和任务的预训练音频模型阻碍了该领域的进展。因此，大多数现有作品只能支持有限范围的交互能力。在本文中，我们开发了Qwen-Audio模型，并通过扩展音频语言预训练范围，涵盖30多项任务和各种音频类型（如人类语音、自然声音、音乐和歌曲），以促进通用音频理解能力，从而解决了这一限制。然而，直接联合训练所有任务和数据集可能会导致干扰问题，因为不同数据集关联的文本标签由于任务焦点、语言、注释粒度和文本结构的差异而存在相当大的变化。为了克服一对多干扰，我们通过在解码器上以一系列分层标签为条件，精心设计了一个多任务训练框架，以鼓励知识共享，并通过共享和指定标签分别避免干扰。值得注意的是，Qwen-Audio在各种基准任务上取得了令人印象深刻的表现，无需任何特定任务的微调，超越了其对手。基于Qwen-Audio的能力，我们进一步开发了Qwen-Audio-Chat，允许接收来自各种音频和文本输入，实现多轮对话并支持各种以音频为中心的场景。

English

Recently, instruction-following audio-language models have received broad attention for audio interaction with humans. However, the absence of pre-trained audio models capable of handling diverse audio types and tasks has hindered progress in this field. Consequently, most existing works have only been able to support a limited range of interaction capabilities. In this paper, we develop the Qwen-Audio model and address this limitation by scaling up audio-language pre-training to cover over 30 tasks and various audio types, such as human speech, natural sounds, music, and songs, to facilitate universal audio understanding abilities. However, directly co-training all tasks and datasets can lead to interference issues, as the textual labels associated with different datasets exhibit considerable variations due to differences in task focus, language, granularity of annotation, and text structure. To overcome the one-to-many interference, we carefully design a multi-task training framework by conditioning on a sequence of hierarchical tags to the decoder for encouraging knowledge sharing and avoiding interference through shared and specified tags respectively. Remarkably, Qwen-Audio achieves impressive performance across diverse benchmark tasks without requiring any task-specific fine-tuning, surpassing its counterparts. Building upon the capabilities of Qwen-Audio, we further develop Qwen-Audio-Chat, which allows for input from various audios and text inputs, enabling multi-turn dialogues and supporting various audio-central scenarios.

Qwen-Audio：通过统一的大规模音频-语言模型推动通用音频理解

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

摘要

Support