OpenBEATs：完全オープンソースの汎用オーディオエンコーダ

要旨

マスクされたトークン予測は、言語、視覚、音声といった多様なモダリティを単一の事前学習タスクを通じて統合する可能性を秘めた強力な事前学習目標として注目を集めています。しかし、一般的な音声理解への応用は未開拓のままであり、BEATsが唯一の注目すべき例となっています。BEATsはオープンソースの事前学習コードが存在しないため、限定的な修正しか加えられていません。さらに、BEATsはAudioSetのみで学習されており、より広範な下流タスクへの適用性が制限されています。これらの課題を解決するため、我々はOpenBEATsを提案します。これは、マルチドメイン音声事前学習を通じてBEATsを拡張するオープンソースフレームワークです。我々は、音声質問応答、含意関係、キャプション生成といった音声推論タスクを含む、6種類のタスク、25のデータセット、3つの音声ドメインにわたる包括的な評価を実施しました。OpenBEATsは、6つのバイオアコースティクスデータセット、2つの環境音データセット、5つの推論データセットにおいて、パラメータサイズが4分の1でありながら、10億パラメータを超えるモデルを上回る最先端の性能を達成しました。これらの結果は、マルチドメインデータセットとマスクされたトークン予測タスクが、汎用的な音声表現を学習する上で有効であることを示しています。さらなる研究と再現性を促進するため、我々はすべての事前学習および評価コード、事前学習済みおよびファインチューニング済みのチェックポイント、トレーニングログをhttps://shikhar-s.github.io/OpenBEATsで公開しています。

English

Masked token prediction has emerged as a powerful pre-training objective across language, vision, and speech, offering the potential to unify these diverse modalities through a single pre-training task. However, its application for general audio understanding remains underexplored, with BEATs being the only notable example. BEATs has seen limited modifications due to the absence of open-source pre-training code. Furthermore, BEATs was trained only on AudioSet, restricting its broader downstream applicability. To address these gaps, we present OpenBEATs, an open-source framework that extends BEATs via multi-domain audio pre-training. We conduct comprehensive evaluations across six types of tasks, twenty five datasets, and three audio domains, including audio reasoning tasks such as audio question answering, entailment, and captioning. OpenBEATs achieves state-of-the-art performance on six bioacoustics datasets, two environmental sound datasets and five reasoning datasets, performing better than models exceeding a billion parameters at one-fourth their parameter size. These results demonstrate the effectiveness of multi-domain datasets and masked token prediction task to learn general-purpose audio representations. To promote further research and reproducibility, we release all pre-training and evaluation code, pretrained and fine-tuned checkpoints, and training logs at https://shikhar-s.github.io/OpenBEATs

OpenBEATs：完全オープンソースの汎用オーディオエンコーダ

OpenBEATs: A Fully Open-Source General-Purpose Audio Encoder

要旨

Support