MicroVQA++：面向多模态大语言模型的高质量显微图像推理数据集与弱监督图结构

摘要

多模态大语言模型在生物医学影像领域的应用日益广泛，但显微成像领域的科学推理仍受限于大规模高质量训练数据的稀缺。我们推出MicroVQA++——一个基于BIOMEDICA档案构建的三阶段、大规模高质量显微视觉问答语料库。第一阶段通过同行评审论文中专家验证的图注对实现监督引导；第二阶段应用HiCQA-Graph新型异质图（覆盖图像、图注和问答），融合基于NLI的文本蕴含、CLIP驱动的视觉-语言对齐以及智能体信号，以识别并过滤不一致样本；第三阶段采用多模态大语言模型智能体生成多选题，并经过人工筛查。最终发布版本包含大规模训练集和经人工校验的测试集，其布鲁姆分类难度样本分布超越MicroVQA基准。本研究的贡献包括：（i）通过专家文献与图过滤及人工精校相结合的质量控制数据集；（ii）首个联合建模（图像、图注、问答）三元组以实现跨模态一致性过滤的HiCQA-Graph；（iii）证明精细数据构建能使40亿参数级MLLM达到媲美GPT-5的显微推理性能，并在开源MLLM中实现最优效果。代码与数据集将在评审结束后公开。

English

Multimodal Large Language Models are increasingly applied to biomedical imaging, yet scientific reasoning for microscopy remains limited by the scarcity of large-scale, high-quality training data. We introduce MicroVQA++, a three-stage, large-scale and high-quality microscopy VQA corpus derived from the BIOMEDICA archive. Stage one bootstraps supervision from expert-validated figure-caption pairs sourced from peer-reviewed articles. Stage two applies HiCQA-Graph, a novel heterogeneous graph over images, captions, and QAs that fuses NLI-based textual entailment, CLIP-based vision-language alignment, and agent signals to identify and filter inconsistent samples. Stage three uses a MultiModal Large Language Model (MLLM) agent to generate multiple-choice questions (MCQ) followed by human screening. The resulting release comprises a large training split and a human-checked test split whose Bloom's level hard-sample distribution exceeds the MicroVQA benchmark. Our work delivers (i) a quality-controlled dataset that couples expert literature with graph-based filtering and human refinement; (ii) HiCQA-Graph, the first graph that jointly models (image, caption, QA) for cross-modal consistency filtering; (iii) evidence that careful data construction enables 4B-scale MLLMs to reach competitive microscopy reasoning performance (e.g., GPT-5) and achieve state-of-the-art performance among open-source MLLMs. Code and dataset will be released after the review process concludes.

MicroVQA++：面向多模态大语言模型的高质量显微图像推理数据集与弱监督图结构

MicroVQA++: High-Quality Microscopy Reasoning Dataset with Weakly Supervised Graphs for Multimodal Large Language Model

摘要

Support