BioVITA: Biologischer Datensatz, Modell und Benchmark für visuell-textuell-akustische Ausrichtung

Zusammenfassung

Die Identifizierung von Tierarten anhand multimodaler Daten stellt eine neuartige Herausforderung an der Schnittstelle von Computer Vision und Ökologie dar. Während neuere biologische Modelle wie BioCLIP eine starke Übereinstimmung zwischen Bildern und textuellen taxonomischen Informationen für die Artbestimmung gezeigt haben, bleibt die Integration der Audiomodalität ein ungelöstes Problem. Wir stellen BioVITA vor, ein neuartiges visuell-textuell-akustisches Alignment-Framework für biologische Anwendungen. BioVITA umfasst (i) einen Trainingsdatensatz, (ii) ein Repräsentationsmodell und (iii) einen Retrieval-Benchmark. Zunächst erstellen wir einen groß angelegten Trainingsdatensatz, der 1,3 Millionen Audioclips und 2,3 Millionen Bilder umfasst und 14.133 Arten abdeckt, die mit 34 ökologischen Merkmalslabels annotiert sind. Zweitens führen wir, aufbauend auf BioCLIP2, ein zweistufiges Trainingsframework ein, um Audio-Repräsentationen effektiv mit visuellen und textuellen Repräsentationen in Einklang zu bringen. Drittens entwickeln wir einen Benchmark für cross-modales Retrieval, der alle möglichen Richtungen des Abrufs über die drei Modalitäten abdeckt (d.h. Bild-zu-Audio, Audio-zu-Text, Text-zu-Bild und deren umgekehrte Richtungen) und drei taxonomische Ebenen umfasst: Familie, Gattung und Art. Umfangreiche Experimente belegen, dass unser Modell einen einheitlichen Repräsentationsraum erlernt, der artbezogene Semantik über die Taxonomie hinaus erfasst und damit das multimodale Verständnis der Biodiversität voranbringt. Die Projektseite ist verfügbar unter: https://dahlian00.github.io/BioVITA_Page/

English

Understanding animal species from multimodal data poses an emerging challenge at the intersection of computer vision and ecology. While recent biological models, such as BioCLIP, have demonstrated strong alignment between images and textual taxonomic information for species identification, the integration of the audio modality remains an open problem. We propose BioVITA, a novel visual-textual-acoustic alignment framework for biological applications. BioVITA involves (i) a training dataset, (ii) a representation model, and (iii) a retrieval benchmark. First, we construct a large-scale training dataset comprising 1.3 million audio clips and 2.3 million images, covering 14,133 species annotated with 34 ecological trait labels. Second, building upon BioCLIP2, we introduce a two-stage training framework to effectively align audio representations with visual and textual representations. Third, we develop a cross-modal retrieval benchmark that covers all possible directional retrieval across the three modalities (i.e., image-to-audio, audio-to-text, text-to-image, and their reverse directions), with three taxonomic levels: Family, Genus, and Species. Extensive experiments demonstrate that our model learns a unified representation space that captures species-level semantics beyond taxonomy, advancing multimodal biodiversity understanding. The project page is available at: https://dahlian00.github.io/BioVITA_Page/

BioVITA: Biologischer Datensatz, Modell und Benchmark für visuell-textuell-akustische Ausrichtung

BioVITA: Biological Dataset, Model, and Benchmark for Visual-Textual-Acoustic Alignment

Zusammenfassung

Support