OpenBEATs：一款全开源的通用音频编码器

摘要

掩码标记预测已成为跨语言、视觉和语音领域的一项强大预训练目标，有望通过单一预训练任务统一这些多样化的模态。然而，其在通用音频理解中的应用仍显不足，BEATs是唯一显著的实例。由于缺乏开源预训练代码，BEATs的改进有限。此外，BEATs仅在AudioSet上训练，限制了其在下游任务中的广泛适用性。为填补这些空白，我们推出了OpenBEATs，一个开源框架，通过多领域音频预训练扩展了BEATs。我们在六类任务、二十五数据集及三个音频领域（包括音频问答、蕴含和字幕生成等音频推理任务）上进行了全面评估。OpenBEATs在六个生物声学数据集、两个环境声音数据集及五个推理数据集上取得了最先进的性能，以仅四分之一参数量的模型超越了数十亿参数规模的模型。这些结果证明了多领域数据集及掩码标记预测任务在学习通用音频表示方面的有效性。为促进进一步研究与可复现性，我们在https://shikhar-s.github.io/OpenBEATs上公开了所有预训练与评估代码、预训练及微调检查点，以及训练日志。

English

Masked token prediction has emerged as a powerful pre-training objective across language, vision, and speech, offering the potential to unify these diverse modalities through a single pre-training task. However, its application for general audio understanding remains underexplored, with BEATs being the only notable example. BEATs has seen limited modifications due to the absence of open-source pre-training code. Furthermore, BEATs was trained only on AudioSet, restricting its broader downstream applicability. To address these gaps, we present OpenBEATs, an open-source framework that extends BEATs via multi-domain audio pre-training. We conduct comprehensive evaluations across six types of tasks, twenty five datasets, and three audio domains, including audio reasoning tasks such as audio question answering, entailment, and captioning. OpenBEATs achieves state-of-the-art performance on six bioacoustics datasets, two environmental sound datasets and five reasoning datasets, performing better than models exceeding a billion parameters at one-fourth their parameter size. These results demonstrate the effectiveness of multi-domain datasets and masked token prediction task to learn general-purpose audio representations. To promote further research and reproducibility, we release all pre-training and evaluation code, pretrained and fine-tuned checkpoints, and training logs at https://shikhar-s.github.io/OpenBEATs

OpenBEATs：一款全开源的通用音频编码器

OpenBEATs: A Fully Open-Source General-Purpose Audio Encoder

摘要

Support