MicroVQA：顕微鏡ベースの科学研究のためのマルチモーダル推論ベンチマーク

要旨

科学研究では、マルチモーダルデータに対する高度な推論が求められ、特に生物学分野でこの課題が顕著です。近年、AI支援研究のためのマルチモーダル大規模言語モデル（MLLMs）が進歩しているにもかかわらず、既存のマルチモーダル推論ベンチマークは大学レベルの難易度までしか対象としておらず、研究レベルのベンチマークは低次元の知覚に重点を置いており、科学的発見に必要な複雑なマルチモーダル推論には及んでいません。このギャップを埋めるため、私たちはMicroVQAを導入しました。これは、研究ワークフローにおいて重要な3つの推論能力（専門的な画像理解、仮説生成、実験提案）を評価するために設計された視覚的質問応答（VQA）ベンチマークです。MicroVQAは、生物学の専門家によって多様な顕微鏡モダリティにわたってキュレーションされた1,042の多肢選択問題（MCQs）で構成されており、VQAサンプルが実際の科学実践を反映するようになっています。ベンチマークの構築において、標準的なMCQ生成方法では言語的ショートカットが生じることがわかり、新しい2段階パイプラインを動機付けました。最適化されたLLMプロンプトが質問-回答ペアをMCQsに構造化し、その後、エージェントベースの「RefineBot」がショートカットを除去するためにそれらを更新します。最先端のMLLMsでのベンチマーク結果は、ピーク性能が53％であり、より小さいLLMsを持つモデルはトップモデルにわずかに劣るだけで、言語ベースの推論はマルチモーダル推論よりも容易であることを示唆しています。また、科学記事を用いたチューニングが性能を向上させることがわかりました。チェーン・オブ・ソートレスポンスの専門家分析によると、知覚エラーが最も頻繁に発生し、次に知識エラー、そして過剰一般化エラーが続きます。これらの洞察は、マルチモーダル科学的推論における課題を浮き彫りにし、MicroVQAがAI駆動の生物医学研究を進めるための貴重なリソースであることを示しています。MicroVQAはhttps://huggingface.co/datasets/jmhb/microvqaで利用可能で、プロジェクトページはhttps://jmhb0.github.io/microvqaにあります。

English

Scientific research demands sophisticated reasoning over multimodal data, a challenge especially prevalent in biology. Despite recent advances in multimodal large language models (MLLMs) for AI-assisted research, existing multimodal reasoning benchmarks only target up to college-level difficulty, while research-level benchmarks emphasize lower-level perception, falling short of the complex multimodal reasoning needed for scientific discovery. To bridge this gap, we introduce MicroVQA, a visual-question answering (VQA) benchmark designed to assess three reasoning capabilities vital in research workflows: expert image understanding, hypothesis generation, and experiment proposal. MicroVQA consists of 1,042 multiple-choice questions (MCQs) curated by biology experts across diverse microscopy modalities, ensuring VQA samples represent real scientific practice. In constructing the benchmark, we find that standard MCQ generation methods induce language shortcuts, motivating a new two-stage pipeline: an optimized LLM prompt structures question-answer pairs into MCQs; then, an agent-based `RefineBot' updates them to remove shortcuts. Benchmarking on state-of-the-art MLLMs reveal a peak performance of 53\%; models with smaller LLMs only slightly underperform top models, suggesting that language-based reasoning is less challenging than multimodal reasoning; and tuning with scientific articles enhances performance. Expert analysis of chain-of-thought responses shows that perception errors are the most frequent, followed by knowledge errors and then overgeneralization errors. These insights highlight the challenges in multimodal scientific reasoning, showing MicroVQA is a valuable resource advancing AI-driven biomedical research. MicroVQA is available at https://huggingface.co/datasets/jmhb/microvqa, and project page at https://jmhb0.github.io/microvqa.

MicroVQA：顕微鏡ベースの科学研究のためのマルチモーダル推論ベンチマーク

MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research

要旨

Support