Qwen2-Audio 技術レポート

要旨

大規模音声言語モデル「Qwen2-Audio」の最新進展を紹介します。Qwen2-Audioは多様な音声信号入力を処理し、音声分析や音声指示に対する直接的なテキスト応答を可能にします。複雑な階層的タグとは対照的に、異なるデータとタスクに対して自然言語プロンプトを活用することで事前学習プロセスを簡素化し、データ量をさらに拡大しました。Qwen2-Audioの指示追従能力を向上させ、音声チャットと音声分析の2つの異なる音声インタラクションモードを実装しました。音声チャットモードでは、ユーザーはテキスト入力を必要とせずにQwen2-Audioと自由に音声対話を行えます。音声分析モードでは、ユーザーは対話中に音声とテキスト指示を提供して分析を行うことができます。音声チャットモードと音声分析モードの切り替えにシステムプロンプトを使用しない点に注意してください。Qwen2-Audioは音声内の内容をインテリジェントに理解し、音声コマンドに従って適切に応答することができます。例えば、音声、複数話者の会話、音声コマンドが同時に含まれる音声セグメントにおいて、Qwen2-Audioは直接コマンドを理解し、音声に対する解釈と応答を提供できます。さらに、DPO（Direct Preference Optimization）により、モデルの事実性と所望の行動への準拠が最適化されました。AIR-Benchの評価結果によると、Qwen2-Audioは音声中心の指示追従能力に焦点を当てたテストにおいて、Gemini-1.5-proなどの従来のSOTA（State-of-the-Art）モデルを上回りました。Qwen2-Audioは、マルチモーダル言語コミュニティの進展を促進するためにオープンソースとして公開されています。

English

We introduce the latest progress of Qwen-Audio, a large-scale audio-language model called Qwen2-Audio, which is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions. In contrast to complex hierarchical tags, we have simplified the pre-training process by utilizing natural language prompts for different data and tasks, and have further expanded the data volume. We have boosted the instruction-following capability of Qwen2-Audio and implemented two distinct audio interaction modes for voice chat and audio analysis. In the voice chat mode, users can freely engage in voice interactions with Qwen2-Audio without text input. In the audio analysis mode, users could provide audio and text instructions for analysis during the interaction. Note that we do not use any system prompts to switch between voice chat and audio analysis modes. Qwen2-Audio is capable of intelligently comprehending the content within audio and following voice commands to respond appropriately. For instance, in an audio segment that simultaneously contains sounds, multi-speaker conversations, and a voice command, Qwen2-Audio can directly understand the command and provide an interpretation and response to the audio. Additionally, DPO has optimized the model's performance in terms of factuality and adherence to desired behavior. According to the evaluation results from AIR-Bench, Qwen2-Audio outperformed previous SOTAs, such as Gemini-1.5-pro, in tests focused on audio-centric instruction-following capabilities. Qwen2-Audio is open-sourced with the aim of fostering the advancement of the multi-modal language community.

Qwen2-Audio 技術レポート

Qwen2-Audio Technical Report

要旨

Support