汎用オーディオ表現のための自然言語による監督

要旨

音声-言語モデルは、マルチモーダルなテキストと音声の表現を共同で学習し、ゼロショット推論を可能にします。モデルは、強力な入力表現を作成し、音、音楽、音声にわたる複数のタスクに一般化するためにエンコーダに依存しています。モデルは顕著な性能を達成していますが、タスク固有のモデルとの間には依然として性能差が存在します。本論文では、4.6Mの音声-テキストペアの多様なコレクションを使用して事前学習され、ゼロショット推論のための2つの革新的なエンコーダを採用したContrastive Language-Audio Pretrainingモデルを提案します。音声表現を学習するために、標準的な音響イベント分類の代わりに、22の音声タスクで音声エンコーダを学習しました。言語表現を学習するために、標準的なエンコーダのみのモデルの代わりに、自己回帰型デコーダのみのモデルを学習しました。その後、音声と言語の表現は、Contrastive Learningを使用して共同のマルチモーダル空間に統合されます。私たちは、エンコーダを使用して下流タスクの性能を向上させました。私たちは、文献で最大の26の下流タスクにおいて、表現の一般化を広範に評価しました。私たちのモデルは、いくつかのタスクで最先端の結果を達成し、汎用音声表現への道を切り開いています。

English

Audio-Language models jointly learn multimodal text and audio representations that enable Zero-Shot inference. Models rely on the encoders to create powerful representations of the input and generalize to multiple tasks ranging from sounds, music, and speech. Although models have achieved remarkable performance, there is still a performance gap with task-specific models. In this paper, we propose a Contrastive Language-Audio Pretraining model that is pretrained with a diverse collection of 4.6M audio-text pairs employing two innovative encoders for Zero-Shot inference. To learn audio representations, we trained an audio encoder on 22 audio tasks, instead of the standard training of sound event classification. To learn language representations, we trained an autoregressive decoder-only model instead of the standard encoder-only models. Then, the audio and language representations are brought into a joint multimodal space using Contrastive Learning. We used our encoders to improve the downstream performance by a margin. We extensively evaluated the generalization of our representations on 26 downstream tasks, the largest in the literature. Our model achieves state of the art results in several tasks leading the way towards general-purpose audio representations.

汎用オーディオ表現のための自然言語による監督

Natural Language Supervision for General-Purpose Audio Representations

要旨

Support