AnimalCLAP：面向物種辨識與特徵推論的跨模態語言-音頻預訓練框架

摘要

动物发声为野生动物评估提供了关键洞察，尤其在森林等复杂环境中，有助于物种识别与生态监测。深度学习的最新进展使得通过动物鸣声实现自动物种分类成为可能。然而，对训练阶段未见过物种的分类仍是挑战。为突破此局限，我们提出AnimalCLAP——一个包含层级生物信息的语谱框架，由新型数据集和模型构成。具体而言，我们的动物发声数据集包含4,225小时录音，涵盖6,823个物种，并标注了22种生态特征。AnimalCLAP模型基于该数据集，利用生物分类结构对齐音频与文本表征，从而提升对未知物种的识别能力。实验表明，我们提出的模型能直接从动物鸣声中推断物种的生态与生物学属性，其性能显著优于CLAP模型。相关数据集、代码及模型将公开于https://dahlian00.github.io/AnimalCLAP_Page/。

English

Animal vocalizations provide crucial insights for wildlife assessment, particularly in complex environments such as forests, aiding species identification and ecological monitoring. Recent advances in deep learning have enabled automatic species classification from their vocalizations. However, classifying species unseen during training remains challenging. To address this limitation, we introduce AnimalCLAP, a taxonomy-aware language-audio framework comprising a new dataset and model that incorporate hierarchical biological information. Specifically, our vocalization dataset consists of 4,225 hours of recordings covering 6,823 species, annotated with 22 ecological traits. The AnimalCLAP model is trained on this dataset to align audio and textual representations using taxonomic structures, improving the recognition of unseen species. We demonstrate that our proposed model effectively infers ecological and biological attributes of species directly from their vocalizations, achieving superior performance compared to CLAP. Our dataset, code, and models will be publicly available at https://dahlian00.github.io/AnimalCLAP_Page/.

AnimalCLAP：面向物種辨識與特徵推論的跨模態語言-音頻預訓練框架

AnimalCLAP: Taxonomy-Aware Language-Audio Pretraining for Species Recognition and Trait Inference

摘要

Support