AnimalCLAP: 종 분류 인식 및 형질 추론을 위한 분류학 인식 언어-오디오 사전 학습

초록

동물 발성 음성은 복잡한 산림 환경과 같은 조건에서 야생동물 평가에 중요한 통찰력을 제공하며, 종 식별 및 생태 모니터링에 기여합니다. 최근 딥러닝의 발전으로 동물 발성 음성을 통한 자동 종 분류가 가능해졌지만, 훈련 과정에서 접하지 못한 미등록 종의 분류는 여전히 과제로 남아 있습니다. 이러한 한계를 해결하기 위해 본 연구에서는 계층적 생물학적 정보를 통합한 새로운 데이터셋과 모델로 구성된 taxonomy-aware 언어-오디오 프레임워크인 AnimalCLAP을 제안합니다. 구체적으로, 우리가 구축한 발성 음성 데이터셋은 6,823종을 아우르는 4,225시간 분량의 녹음 자료로 구성되었으며, 22가지 생태 특성으로 주석이 달려 있습니다. AnimalCLAP 모델은 이 데이터셋을 기반으로 분류학적 구조를 활용하여 오디오와 텍스트 표현의 정렬을 학습함으로써 미등록 종 인식 성능을 향상시킵니다. 우리가 제안한 모델은 CLAP 대비 우수한 성능을 달성하며, 동물 발성 음성만으로부터 종의 생태적 및 생물학적 속성을 효과적으로 추론함을 입증합니다. 본 데이터셋, 코드 및 모델은 https://dahlian00.github.io/AnimalCLAP_Page/에서 공개될 예정입니다.

English

Animal vocalizations provide crucial insights for wildlife assessment, particularly in complex environments such as forests, aiding species identification and ecological monitoring. Recent advances in deep learning have enabled automatic species classification from their vocalizations. However, classifying species unseen during training remains challenging. To address this limitation, we introduce AnimalCLAP, a taxonomy-aware language-audio framework comprising a new dataset and model that incorporate hierarchical biological information. Specifically, our vocalization dataset consists of 4,225 hours of recordings covering 6,823 species, annotated with 22 ecological traits. The AnimalCLAP model is trained on this dataset to align audio and textual representations using taxonomic structures, improving the recognition of unseen species. We demonstrate that our proposed model effectively infers ecological and biological attributes of species directly from their vocalizations, achieving superior performance compared to CLAP. Our dataset, code, and models will be publicly available at https://dahlian00.github.io/AnimalCLAP_Page/.

AnimalCLAP: 종 분류 인식 및 형질 추론을 위한 분류학 인식 언어-오디오 사전 학습

AnimalCLAP: Taxonomy-Aware Language-Audio Pretraining for Species Recognition and Trait Inference

초록

Support