BioVITA:面向视觉-文本-声学对齐的生物数据集、模型与基准平台
BioVITA: Biological Dataset, Model, and Benchmark for Visual-Textual-Acoustic Alignment
March 25, 2026
作者: Risa Shinoda, Kaede Shiohara, Nakamasa Inoue, Kuniaki Saito, Hiroaki Santo, Fumio Okura
cs.AI
摘要
基于多模态数据识别动物物种是计算机视觉与生态学交叉领域的新兴挑战。尽管BioCLIP等最新生物模型已证明图像与文本分类信息在物种识别方面具有强关联性,但音频模态的整合仍是待解难题。我们提出BioVITA——一种面向生物应用的新型视觉-文本-声学对齐框架,该框架包含三大核心组件:(一)训练数据集;(二)表征模型;(三)检索基准。首先,我们构建了涵盖14,133个物种、包含130万条音频片段与230万张图像的大规模训练数据集,并标注了34种生态特征标签。其次,基于BioCLIP2架构,我们引入两阶段训练框架,有效实现音频表征与视觉、文本表征的对齐。第三,开发了覆盖三模态全方向检索的基准任务(如图像-音频、音频-文本、文本-图像及其反向检索),包含科、属、种三个分类层级。大量实验表明,我们的模型学习的统一表征空间能捕捉超越分类层级的物种语义,推动多模态生物多样性理解研究。项目页面详见:https://dahlian00.github.io/BioVITA_Page/
English
Understanding animal species from multimodal data poses an emerging challenge at the intersection of computer vision and ecology. While recent biological models, such as BioCLIP, have demonstrated strong alignment between images and textual taxonomic information for species identification, the integration of the audio modality remains an open problem. We propose BioVITA, a novel visual-textual-acoustic alignment framework for biological applications. BioVITA involves (i) a training dataset, (ii) a representation model, and (iii) a retrieval benchmark. First, we construct a large-scale training dataset comprising 1.3 million audio clips and 2.3 million images, covering 14,133 species annotated with 34 ecological trait labels. Second, building upon BioCLIP2, we introduce a two-stage training framework to effectively align audio representations with visual and textual representations. Third, we develop a cross-modal retrieval benchmark that covers all possible directional retrieval across the three modalities (i.e., image-to-audio, audio-to-text, text-to-image, and their reverse directions), with three taxonomic levels: Family, Genus, and Species. Extensive experiments demonstrate that our model learns a unified representation space that captures species-level semantics beyond taxonomy, advancing multimodal biodiversity understanding. The project page is available at: https://dahlian00.github.io/BioVITA_Page/