Qwen2-Audio 技術報告
Qwen2-Audio Technical Report
July 15, 2024
作者: Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, Chang Zhou, Jingren Zhou
cs.AI
摘要
我們介紹了Qwen-Audio的最新進展,一個名為Qwen2-Audio的大規模音訊語言模型,能夠接受各種音訊信號輸入,並根據語音指令進行音訊分析或直接文本回應。與複雜的階層式標籤相比,我們通過利用自然語言提示簡化了預訓練過程,針對不同數據和任務進一步擴展了數據量。我們增強了Qwen2-Audio的指令遵循能力,實現了兩種不同的音訊交互模式,用於語音聊天和音訊分析。在語音聊天模式中,用戶可以與Qwen2-Audio自由進行語音交互,無需文本輸入。在音訊分析模式中,用戶可以在交互過程中提供音訊和文本指令進行分析。需要注意的是,我們不使用任何系統提示來在語音聊天和音訊分析模式之間切換。Qwen2-Audio能夠智能理解音訊內容並按照語音指令做出適當回應。例如,在同時包含聲音、多人對話和語音指令的音訊片段中,Qwen2-Audio能夠直接理解指令並對音訊進行解釋和回應。此外,DPO已優化了模型的性能,提高了事實性和符合所需行為的程度。根據AIR-Bench的評估結果,Qwen2-Audio在針對音訊中心指令遵循能力的測試中優於以往的SOTAs,如Gemini-1.5-pro。Qwen2-Audio是開源的,旨在促進多模態語言社區的發展。
English
We introduce the latest progress of Qwen-Audio, a large-scale audio-language
model called Qwen2-Audio, which is capable of accepting various audio signal
inputs and performing audio analysis or direct textual responses with regard to
speech instructions. In contrast to complex hierarchical tags, we have
simplified the pre-training process by utilizing natural language prompts for
different data and tasks, and have further expanded the data volume. We have
boosted the instruction-following capability of Qwen2-Audio and implemented two
distinct audio interaction modes for voice chat and audio analysis. In the
voice chat mode, users can freely engage in voice interactions with Qwen2-Audio
without text input. In the audio analysis mode, users could provide audio and
text instructions for analysis during the interaction. Note that we do not use
any system prompts to switch between voice chat and audio analysis modes.
Qwen2-Audio is capable of intelligently comprehending the content within audio
and following voice commands to respond appropriately. For instance, in an
audio segment that simultaneously contains sounds, multi-speaker conversations,
and a voice command, Qwen2-Audio can directly understand the command and
provide an interpretation and response to the audio. Additionally, DPO has
optimized the model's performance in terms of factuality and adherence to
desired behavior. According to the evaluation results from AIR-Bench,
Qwen2-Audio outperformed previous SOTAs, such as Gemini-1.5-pro, in tests
focused on audio-centric instruction-following capabilities. Qwen2-Audio is
open-sourced with the aim of fostering the advancement of the multi-modal
language community.Summary
AI-Generated Summary