MicroVQA：基於顯微鏡的科學研究之多模態推理基準

摘要

科學研究需要對多模態數據進行複雜的推理，這一挑戰在生物學領域尤為突出。儘管近年來多模態大語言模型（MLLMs）在AI輔助研究方面取得了進展，但現有的多模態推理基準僅針對大學水平的難度，而研究級別的基準則強調低層次的感知，未能滿足科學發現所需的複雜多模態推理。為彌補這一差距，我們引入了MicroVQA，這是一個視覺問答（VQA）基準，旨在評估研究工作流程中至關重要的三種推理能力：專家圖像理解、假設生成和實驗提案。MicroVQA由生物學專家精心策劃的1,042道多項選擇題（MCQs）組成，涵蓋多種顯微鏡模態，確保VQA樣本代表真實的科學實踐。在構建基準的過程中，我們發現標準的MCQ生成方法會引入語言捷徑，這促使我們開發了一種新的兩階段流程：首先，通過優化的LLM提示將問答對結構化為MCQs；然後，基於代理的`RefineBot`更新這些問題以去除捷徑。在最先進的MLLMs上進行基準測試顯示，峰值性能為53%；較小的LLMs模型僅略微落後於頂級模型，這表明基於語言的推理比多模態推理更具挑戰性；而使用科學文章進行微調則能提升性能。專家對思維鏈響應的分析顯示，感知錯誤最為常見，其次是知識錯誤，最後是過度概括錯誤。這些見解凸顯了多模態科學推理的挑戰，表明MicroVQA是推動AI驅動生物醫學研究的寶貴資源。MicroVQA可在https://huggingface.co/datasets/jmhb/microvqa獲取，項目頁面位於https://jmhb0.github.io/microvqa。

English

Scientific research demands sophisticated reasoning over multimodal data, a challenge especially prevalent in biology. Despite recent advances in multimodal large language models (MLLMs) for AI-assisted research, existing multimodal reasoning benchmarks only target up to college-level difficulty, while research-level benchmarks emphasize lower-level perception, falling short of the complex multimodal reasoning needed for scientific discovery. To bridge this gap, we introduce MicroVQA, a visual-question answering (VQA) benchmark designed to assess three reasoning capabilities vital in research workflows: expert image understanding, hypothesis generation, and experiment proposal. MicroVQA consists of 1,042 multiple-choice questions (MCQs) curated by biology experts across diverse microscopy modalities, ensuring VQA samples represent real scientific practice. In constructing the benchmark, we find that standard MCQ generation methods induce language shortcuts, motivating a new two-stage pipeline: an optimized LLM prompt structures question-answer pairs into MCQs; then, an agent-based `RefineBot' updates them to remove shortcuts. Benchmarking on state-of-the-art MLLMs reveal a peak performance of 53\%; models with smaller LLMs only slightly underperform top models, suggesting that language-based reasoning is less challenging than multimodal reasoning; and tuning with scientific articles enhances performance. Expert analysis of chain-of-thought responses shows that perception errors are the most frequent, followed by knowledge errors and then overgeneralization errors. These insights highlight the challenges in multimodal scientific reasoning, showing MicroVQA is a valuable resource advancing AI-driven biomedical research. MicroVQA is available at https://huggingface.co/datasets/jmhb/microvqa, and project page at https://jmhb0.github.io/microvqa.

MicroVQA：基於顯微鏡的科學研究之多模態推理基準

MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research

摘要

Support