Qwen-Audio: 統一された大規模音声-言語モデルによる汎用音声理解の推進

要旨

近年、指示に従う音声-言語モデルが人間との音声インタラクションにおいて広く注目を集めています。しかし、多様な音声タイプやタスクを扱える事前学習済み音声モデルの欠如が、この分野の進展を妨げてきました。その結果、既存の研究のほとんどは限られた範囲のインタラクション能力しかサポートできていません。本論文では、Qwen-Audioモデルを開発し、30以上のタスクと人間の音声、自然音、音楽、歌などの様々な音声タイプをカバーする音声-言語事前学習をスケールアップすることで、この制限を克服し、普遍的な音声理解能力を促進します。しかし、すべてのタスクとデータセットを直接共同で学習すると、タスクの焦点、言語、アノテーションの粒度、テキスト構造の違いにより、異なるデータセットに関連付けられたテキストラベルが大きく異なるため、干渉問題が生じる可能性があります。一対多の干渉を克服するために、デコーダに階層的なタグのシーケンスを条件付けすることで、共有タグと指定タグをそれぞれ通じて知識共有を促進し、干渉を回避する多タスク学習フレームワークを慎重に設計しました。注目すべきは、Qwen-Audioがタスク固有の微調整を必要とせずに、多様なベンチマークタスクで印象的な性能を達成し、他のモデルを凌駕している点です。Qwen-Audioの能力を基に、さらにQwen-Audio-Chatを開発し、様々な音声とテキスト入力を可能にし、マルチターン対話を実現し、音声中心の様々なシナリオをサポートします。

English

Recently, instruction-following audio-language models have received broad attention for audio interaction with humans. However, the absence of pre-trained audio models capable of handling diverse audio types and tasks has hindered progress in this field. Consequently, most existing works have only been able to support a limited range of interaction capabilities. In this paper, we develop the Qwen-Audio model and address this limitation by scaling up audio-language pre-training to cover over 30 tasks and various audio types, such as human speech, natural sounds, music, and songs, to facilitate universal audio understanding abilities. However, directly co-training all tasks and datasets can lead to interference issues, as the textual labels associated with different datasets exhibit considerable variations due to differences in task focus, language, granularity of annotation, and text structure. To overcome the one-to-many interference, we carefully design a multi-task training framework by conditioning on a sequence of hierarchical tags to the decoder for encouraging knowledge sharing and avoiding interference through shared and specified tags respectively. Remarkably, Qwen-Audio achieves impressive performance across diverse benchmark tasks without requiring any task-specific fine-tuning, surpassing its counterparts. Building upon the capabilities of Qwen-Audio, we further develop Qwen-Audio-Chat, which allows for input from various audios and text inputs, enabling multi-turn dialogues and supporting various audio-central scenarios.

Qwen-Audio: 統一された大規模音声-言語モデルによる汎用音声理解の推進

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

要旨

Support