ReCLAP: 音を記述することによるゼロショット音声分類の改善

要旨

オープンボキャブラリーのオーディオ言語モデルであるCLAPは、自然言語プロンプトで指定された任意のカテゴリの分類を可能にすることで、ゼロショットオーディオ分類（ZSAC）に有望なアプローチを提供します。本論文では、CLAPを用いたZSACの改善のためのシンプルかつ効果的な手法を提案します。具体的には、抽象的なカテゴリラベル（例：オルガンの音）を用いる従来の手法から、多様なコンテキストで音をその固有の記述的特徴を用いて記述するプロンプトに移行します（例：オルガンの深く響きのある音色が大聖堂に満ちた）。これを実現するために、まず、野生の音を理解するために改良されたオーディオキャプションで訓練されたCLAPモデルであるReCLAPを提案します。これらの改良されたキャプションは、各音のイベントをその固有の識別特性を用いて元のキャプションで記述します。ReCLAPは、マルチモーダルオーディオテキスト検索とZSACの両方ですべてのベースラインを上回ります。次に、ReCLAPを用いたゼロショットオーディオ分類を改善するために、プロンプトの拡張を提案します。データセット内の各ユニークなラベルに対してカスタムプロンプトを生成し、従来の手書きのテンプレートプロンプトを用いる伝統的な手法とは対照的に、これらのカスタムプロンプトはまずラベル内の音のイベントを記述し、それを様々なシーンで活用します。提案された手法により、ZSACでのReCLAPのパフォーマンスが1%〜18%向上し、すべてのベースラインを1%〜55%上回ります。

English

Open-vocabulary audio-language models, like CLAP, offer a promising approach for zero-shot audio classification (ZSAC) by enabling classification with any arbitrary set of categories specified with natural language prompts. In this paper, we propose a simple but effective method to improve ZSAC with CLAP. Specifically, we shift from the conventional method of using prompts with abstract category labels (e.g., Sound of an organ) to prompts that describe sounds using their inherent descriptive features in a diverse context (e.g.,The organ's deep and resonant tones filled the cathedral.). To achieve this, we first propose ReCLAP, a CLAP model trained with rewritten audio captions for improved understanding of sounds in the wild. These rewritten captions describe each sound event in the original caption using their unique discriminative characteristics. ReCLAP outperforms all baselines on both multi-modal audio-text retrieval and ZSAC. Next, to improve zero-shot audio classification with ReCLAP, we propose prompt augmentation. In contrast to the traditional method of employing hand-written template prompts, we generate custom prompts for each unique label in the dataset. These custom prompts first describe the sound event in the label and then employ them in diverse scenes. Our proposed method improves ReCLAP's performance on ZSAC by 1%-18% and outperforms all baselines by 1% - 55%.

ReCLAP: 音を記述することによるゼロショット音声分類の改善

ReCLAP: Improving Zero Shot Audio Classification by Describing Sounds

要旨

Support