ChatPaper.aiChatPaper

OpenBEATs:一個完全開源的通用音頻編碼器

OpenBEATs: A Fully Open-Source General-Purpose Audio Encoder

July 18, 2025
作者: Shikhar Bharadwaj, Samuele Cornell, Kwanghee Choi, Satoru Fukayama, Hye-jin Shim, Soham Deshmukh, Shinji Watanabe
cs.AI

摘要

掩碼標記預測已成為跨語言、視覺和語音領域的一種強大預訓練目標,提供了通過單一預訓練任務統一這些多樣模態的潛力。然而,其在通用音頻理解中的應用仍未被充分探索,BEATs 是唯一顯著的例子。由於缺乏開源的預訓練代碼,BEATs 的修改有限。此外,BEATs 僅在 AudioSet 上進行訓練,限制了其在下游任務中的廣泛適用性。為解決這些不足,我們提出了 OpenBEATs,這是一個開源框架,通過多領域音頻預訓練擴展了 BEATs。我們在六種類型的任務、二十五個數據集和三個音頻領域中進行了全面評估,包括音頻問答、蘊涵和字幕生成等音頻推理任務。OpenBEATs 在六個生物聲學數據集、兩個環境聲音數據集和五個推理數據集上達到了最先進的性能,在參數規模僅為其四分之一的條件下,表現優於參數超過十億的模型。這些結果證明了多領域數據集和掩碼標記預測任務在學習通用音頻表示方面的有效性。為促進進一步研究和可重現性,我們在 https://shikhar-s.github.io/OpenBEATs 上發布了所有預訓練和評估代碼、預訓練和微調的檢查點以及訓練日誌。
English
Masked token prediction has emerged as a powerful pre-training objective across language, vision, and speech, offering the potential to unify these diverse modalities through a single pre-training task. However, its application for general audio understanding remains underexplored, with BEATs being the only notable example. BEATs has seen limited modifications due to the absence of open-source pre-training code. Furthermore, BEATs was trained only on AudioSet, restricting its broader downstream applicability. To address these gaps, we present OpenBEATs, an open-source framework that extends BEATs via multi-domain audio pre-training. We conduct comprehensive evaluations across six types of tasks, twenty five datasets, and three audio domains, including audio reasoning tasks such as audio question answering, entailment, and captioning. OpenBEATs achieves state-of-the-art performance on six bioacoustics datasets, two environmental sound datasets and five reasoning datasets, performing better than models exceeding a billion parameters at one-fourth their parameter size. These results demonstrate the effectiveness of multi-domain datasets and masked token prediction task to learn general-purpose audio representations. To promote further research and reproducibility, we release all pre-training and evaluation code, pretrained and fine-tuned checkpoints, and training logs at https://shikhar-s.github.io/OpenBEATs
PDF81July 21, 2025