动物跨模态学习：面向物种识别与性状推断的分类学感知语言-音频预训练模型

摘要

动物发声为野生动物评估提供了关键线索，尤其在森林等复杂环境中，有助于物种识别与生态监测。深度学习的最新进展使得基于动物叫声的自动物种分类成为可能。然而，对训练阶段未见过物种的分类仍是挑战。为突破这一局限，我们提出AnimalCLAP——一个包含分类学信息的语言-音频框架，该框架由融合层级化生物信息的新型数据集和模型构成。具体而言，我们的动物发声数据集包含4,225小时录音，涵盖6,823个物种，并标注了22种生态特征。AnimalCLAP模型基于该数据集训练，通过分类学结构对齐音频与文本表征，从而提升对未知物种的识别能力。实验表明，我们提出的模型能直接从动物叫声中推断物种的生态与生物学属性，其性能显著优于CLAP模型。相关数据集、代码和模型将在https://dahlian00.github.io/AnimalCLAP_Page/ 公开。

English

Animal vocalizations provide crucial insights for wildlife assessment, particularly in complex environments such as forests, aiding species identification and ecological monitoring. Recent advances in deep learning have enabled automatic species classification from their vocalizations. However, classifying species unseen during training remains challenging. To address this limitation, we introduce AnimalCLAP, a taxonomy-aware language-audio framework comprising a new dataset and model that incorporate hierarchical biological information. Specifically, our vocalization dataset consists of 4,225 hours of recordings covering 6,823 species, annotated with 22 ecological traits. The AnimalCLAP model is trained on this dataset to align audio and textual representations using taxonomic structures, improving the recognition of unseen species. We demonstrate that our proposed model effectively infers ecological and biological attributes of species directly from their vocalizations, achieving superior performance compared to CLAP. Our dataset, code, and models will be publicly available at https://dahlian00.github.io/AnimalCLAP_Page/.