Qwen2-Audio 技术报告

摘要

我们介绍了Qwen-Audio的最新进展，这是一个名为Qwen2-Audio的大规模音频-语言模型，能够接受各种音频信号输入，并根据语音指令进行音频分析或直接文本回应。与复杂的分层标签相比，我们通过利用自然语言提示简化了预训练过程，针对不同数据和任务，并进一步扩大了数据量。我们增强了Qwen2-Audio的指令跟随能力，并实现了语音聊天和音频分析两种不同的音频交互模式。在语音聊天模式下，用户可以自由与Qwen2-Audio进行语音交互，无需文本输入。在音频分析模式下，用户可以在交互过程中提供音频和文本指令进行分析。请注意，我们不使用任何系统提示来在语音聊天和音频分析模式之间切换。Qwen2-Audio能够智能理解音频内容并遵循语音指令做出适当回应。例如，在同时包含声音、多人对话和语音指令的音频片段中，Qwen2-Audio能直接理解指令并对音频进行解释和回应。此外，DPO已优化了模型在事实性和符合期望行为方面的性能。根据AIR-Bench的评估结果，Qwen2-Audio在着重于音频中心指令跟随能力的测试中胜过了之前的SOTAs，如Gemini-1.5-pro。Qwen2-Audio是开源的，旨在促进多模态语言社区的发展。

English

We introduce the latest progress of Qwen-Audio, a large-scale audio-language model called Qwen2-Audio, which is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions. In contrast to complex hierarchical tags, we have simplified the pre-training process by utilizing natural language prompts for different data and tasks, and have further expanded the data volume. We have boosted the instruction-following capability of Qwen2-Audio and implemented two distinct audio interaction modes for voice chat and audio analysis. In the voice chat mode, users can freely engage in voice interactions with Qwen2-Audio without text input. In the audio analysis mode, users could provide audio and text instructions for analysis during the interaction. Note that we do not use any system prompts to switch between voice chat and audio analysis modes. Qwen2-Audio is capable of intelligently comprehending the content within audio and following voice commands to respond appropriately. For instance, in an audio segment that simultaneously contains sounds, multi-speaker conversations, and a voice command, Qwen2-Audio can directly understand the command and provide an interpretation and response to the audio. Additionally, DPO has optimized the model's performance in terms of factuality and adherence to desired behavior. According to the evaluation results from AIR-Bench, Qwen2-Audio outperformed previous SOTAs, such as Gemini-1.5-pro, in tests focused on audio-centric instruction-following capabilities. Qwen2-Audio is open-sourced with the aim of fostering the advancement of the multi-modal language community.

Qwen2-Audio 技术报告

Qwen2-Audio Technical Report

摘要

Support