BioVITA：面向視覺-文本-聽覺對齊的生物學數據集、模型與基準測試框架

摘要

基於多模態數據識別動物物種已成為計算機視覺與生態學交叉領域的新興挑戰。儘管近期提出的生物學模型（如BioCLIP）在圖像與文本分類學信息的對齊方面表現出色，但音頻模態的整合仍是待解決的難題。我們提出BioVITA——一種面向生物學應用的新型視覺-文本-聲學對齊框架。該框架包含三大組件：（1）訓練數據集；（2）表徵模型；（3）檢索基準。首先，我們構建了涵蓋14,133個物種的大規模訓練數據集，包含130萬個音頻片段和230萬張圖像，並標註了34種生態性狀標籤。其次，在BioCLIP2基礎上引入兩階段訓練框架，有效實現音頻表徵與視覺、文本表徵的對齊。第三，開發了跨模態檢索基準，覆蓋三種模態間所有方向的檢索任務（如圖像-音頻、音頻-文本、文本-圖像及其反向檢索），並包含科、屬、種三個分類層級。大量實驗表明，我們的模型學習到的統一表徵空間能捕捉超越分類學層級的物種語義，推動了多模態生物多樣性理解的發展。項目頁面請訪問：https://dahlian00.github.io/BioVITA_Page/

English

Understanding animal species from multimodal data poses an emerging challenge at the intersection of computer vision and ecology. While recent biological models, such as BioCLIP, have demonstrated strong alignment between images and textual taxonomic information for species identification, the integration of the audio modality remains an open problem. We propose BioVITA, a novel visual-textual-acoustic alignment framework for biological applications. BioVITA involves (i) a training dataset, (ii) a representation model, and (iii) a retrieval benchmark. First, we construct a large-scale training dataset comprising 1.3 million audio clips and 2.3 million images, covering 14,133 species annotated with 34 ecological trait labels. Second, building upon BioCLIP2, we introduce a two-stage training framework to effectively align audio representations with visual and textual representations. Third, we develop a cross-modal retrieval benchmark that covers all possible directional retrieval across the three modalities (i.e., image-to-audio, audio-to-text, text-to-image, and their reverse directions), with three taxonomic levels: Family, Genus, and Species. Extensive experiments demonstrate that our model learns a unified representation space that captures species-level semantics beyond taxonomy, advancing multimodal biodiversity understanding. The project page is available at: https://dahlian00.github.io/BioVITA_Page/

BioVITA：面向視覺-文本-聽覺對齊的生物學數據集、模型與基準測試框架

BioVITA: Biological Dataset, Model, and Benchmark for Visual-Textual-Acoustic Alignment

摘要

Support