VLM4Bio：一個用於評估預訓練視覺語言模型在生物影像中特徵發現的基準數據集。

摘要

圖像在記錄地球生物多樣性方面越來越重要，為生物學領域的科學發現提供了新的加速機會，特別是隨著大視覺語言模型（VLMs）的出現。我們探討預訓練的VLMs是否可以幫助科學家回答各種與生物相關的問題，而無需進行任何額外的微調。本文評估了12個最先進的VLMs在生物學領域的效果，使用一個新的數據集VLM4Bio，包含469K個問答對，涉及三組生物圖像：魚類、鳥類和蝴蝶，涵蓋五個與生物相關的任務。我們還探討了應用提示技術和對VLMs性能的推理幻覺測試的影響，為使用圖像回答與生物相關問題方面的當前最先進VLMs的能力帶來新的見解。本文報告的所有分析的代碼和數據集可在https://github.com/sammarfy/VLM4Bio 找到。

English

Images are increasingly becoming the currency for documenting biodiversity on the planet, providing novel opportunities for accelerating scientific discoveries in the field of organismal biology, especially with the advent of large vision-language models (VLMs). We ask if pre-trained VLMs can aid scientists in answering a range of biologically relevant questions without any additional fine-tuning. In this paper, we evaluate the effectiveness of 12 state-of-the-art (SOTA) VLMs in the field of organismal biology using a novel dataset, VLM4Bio, consisting of 469K question-answer pairs involving 30K images from three groups of organisms: fishes, birds, and butterflies, covering five biologically relevant tasks. We also explore the effects of applying prompting techniques and tests for reasoning hallucination on the performance of VLMs, shedding new light on the capabilities of current SOTA VLMs in answering biologically relevant questions using images. The code and datasets for running all the analyses reported in this paper can be found at https://github.com/sammarfy/VLM4Bio.

VLM4Bio：一個用於評估預訓練視覺語言模型在生物影像中特徵發現的基準數據集。

VLM4Bio: A Benchmark Dataset to Evaluate Pretrained Vision-Language Models for Trait Discovery from Biological Images

摘要

Support