AnimalCLAP : Prétraitement audio-linguistique avec taxonomie pour la reconnaissance d'espèces et l'inférence de traits

Résumé

Les vocalisations animales fournissent des informations cruciales pour l'évaluation de la faune, particulièrement dans des environnements complexes comme les forêts, en facilitant l'identification des espèces et le suivi écologique. Les récentes avancées en apprentissage profond ont permis la classification automatique des espèces à partir de leurs vocalisations. Cependant, la classification d'espèces non rencontrées lors de l'entraînement reste un défi. Pour pallier cette limitation, nous présentons AnimalCLAP, un cadre audio-langage intégrant la taxonomie, comprenant un nouveau jeu de données et un modèle qui incorporent l'information biologique hiérarchique. Concrètement, notre jeu de données de vocalisations comprend 4 225 heures d'enregistrements couvrant 6 823 espèces, annotées avec 22 traits écologiques. Le modèle AnimalCLAP est entraîné sur ce jeu de données pour aligner les représentations audio et textuelles en utilisant les structures taxonomiques, améliorant ainsi la reconnaissance des espèces non vues. Nous démontrons que notre modèle proposé infère efficacement les attributs écologiques et biologiques des espèces directement à partir de leurs vocalisations, obtenant des performances supérieures à CLAP. Notre jeu de données, code et modèles seront publiquement disponibles à l'adresse https://dahlian00.github.io/AnimalCLAP_Page/.

English

Animal vocalizations provide crucial insights for wildlife assessment, particularly in complex environments such as forests, aiding species identification and ecological monitoring. Recent advances in deep learning have enabled automatic species classification from their vocalizations. However, classifying species unseen during training remains challenging. To address this limitation, we introduce AnimalCLAP, a taxonomy-aware language-audio framework comprising a new dataset and model that incorporate hierarchical biological information. Specifically, our vocalization dataset consists of 4,225 hours of recordings covering 6,823 species, annotated with 22 ecological traits. The AnimalCLAP model is trained on this dataset to align audio and textual representations using taxonomic structures, improving the recognition of unseen species. We demonstrate that our proposed model effectively infers ecological and biological attributes of species directly from their vocalizations, achieving superior performance compared to CLAP. Our dataset, code, and models will be publicly available at https://dahlian00.github.io/AnimalCLAP_Page/.

AnimalCLAP : Prétraitement audio-linguistique avec taxonomie pour la reconnaissance d'espèces et l'inférence de traits

AnimalCLAP: Taxonomy-Aware Language-Audio Pretraining for Species Recognition and Trait Inference

Résumé

Support