樹木園:一個大型多模態數據集,促進生物多樣性的人工智慧研究。
Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity
June 25, 2024
作者: Chih-Hsuan Yang, Benjamin Feuer, Zaki Jubery, Zi K. Deng, Andre Nakkab, Md Zahid Hasan, Shivani Chiranjeevi, Kelly Marshall, Nirmal Baishnab, Asheesh K Singh, Arti Singh, Soumik Sarkar, Nirav Merchant, Chinmay Hegde, Baskar Ganapathysubramanian
cs.AI
摘要
我們介紹Arboretum,這是一個旨在推進生物多樣性應用人工智慧的最大公開數據集。這個數據集是從iNaturalist社區科學平台精心挑選並由領域專家審核以確保準確性,包含了1.346億張圖像,比現有數據集的規模大了一個數量級。該數據集涵蓋了來自鳥類(Aves)、蜘蛛/壁蝨/蟎類(Arachnida)、昆蟲(Insecta)、植物(Plantae)、真菌/蘑菇(Fungi)、蝸牛(Mollusca)以及蛇類/蜥蜴類(Reptilia)等多種物種的圖像-語言配對數據,使其成為生物多樣性評估和農業研究的多模式視覺-語言人工智慧模型的寶貴資源。每張圖像都標註了科學名稱、分類細節和俗名,增強了人工智慧模型訓練的穩健性。
我們通過釋出一套使用4000萬標註圖像子集訓練的CLIP模型展示了Arboretum的價值。我們引入了幾個新的嚴格評估基準,報告了零樣本學習的準確性,以及在生命階段、罕見物種、混淆物種和各種分類階層上的評估。
我們預計Arboretum將推動能夠實現從害蟲控制策略、作物監測,到全球生物多樣性評估和環境保護等各種數字工具的人工智慧模型的發展。這些進步對確保食品安全、保護生態系統以及減緩氣候變化的影響至關重要。Arboretum是公開可用的,易於訪問,並可立即使用。
請查看https://baskargroup.github.io/Arboretum/ {項目網站} 以獲取我們的數據、模型和代碼的鏈接。
English
We introduce Arboretum, the largest publicly accessible dataset designed to
advance AI for biodiversity applications. This dataset, curated from the
iNaturalist community science platform and vetted by domain experts to ensure
accuracy, includes 134.6 million images, surpassing existing datasets in scale
by an order of magnitude. The dataset encompasses image-language paired data
for a diverse set of species from birds (Aves), spiders/ticks/mites
(Arachnida), insects (Insecta), plants (Plantae), fungus/mushrooms (Fungi),
snails (Mollusca), and snakes/lizards (Reptilia), making it a valuable resource
for multimodal vision-language AI models for biodiversity assessment and
agriculture research. Each image is annotated with scientific names, taxonomic
details, and common names, enhancing the robustness of AI model training.
We showcase the value of Arboretum by releasing a suite of CLIP models
trained using a subset of 40 million captioned images. We introduce several new
benchmarks for rigorous assessment, report accuracy for zero-shot learning, and
evaluations across life stages, rare species, confounding species, and various
levels of the taxonomic hierarchy.
We anticipate that Arboretum will spur the development of AI models that can
enable a variety of digital tools ranging from pest control strategies, crop
monitoring, and worldwide biodiversity assessment and environmental
conservation. These advancements are critical for ensuring food security,
preserving ecosystems, and mitigating the impacts of climate change. Arboretum
is publicly available, easily accessible, and ready for immediate use.
Please see the https://baskargroup.github.io/Arboretum/{project
website} for links to our data, models, and code.