ChatPaper.aiChatPaper

從宏觀到微觀:透過視覺語言模型對分子微觀空間智能進行基準測試

From Macro to Micro: Benchmarking Microscopic Spatial Intelligence on Molecules via Vision-Language Models

December 11, 2025
作者: Zongzhao Li, Xiangzhe Kong, Jiahui Su, Zongyang Ma, Mingze Li, Songyou Li, Yuelin Zhang, Yu Rong, Tingyang Xu, Deli Zhao, Wenbing Huang
cs.AI

摘要

本文提出了微觀空間智能(MiSI)的概念,即感知與推斷不可見微觀實體空間關係的能力,此能力是科學發現的基礎。為評估視覺語言模型(VLMs)在此領域的潛力,我們提出系統性基準框架 MiSI-Bench。該框架包含超過16.3萬組問答對與58.7萬張影像,源自約4,000個分子結構,涵蓋九項互補任務,從基礎空間變換到複雜關係識別進行能力評估。實驗結果顯示,當前最先進的VLM在此基準上的表現顯著低於人類水平。然而,經過微調的70億參數模型展現出巨大潛力,甚至在空間變換任務中超越人類,但其在氫鍵識別等科學基礎任務中的薄弱表現,凸顯了整合顯性領域知識對於實現科學通用人工智慧的必要性。資料集可於 https://huggingface.co/datasets/zongzhao/MiSI-bench 獲取。
English
This paper introduces the concept of Microscopic Spatial Intelligence (MiSI), the capability to perceive and reason about the spatial relationships of invisible microscopic entities, which is fundamental to scientific discovery. To assess the potential of Vision-Language Models (VLMs) in this domain, we propose a systematic benchmark framework MiSI-Bench. This framework features over 163,000 question-answer pairs and 587,000 images derived from approximately 4,000 molecular structures, covering nine complementary tasks that evaluate abilities ranging from elementary spatial transformations to complex relational identifications. Experimental results reveal that current state-of-the-art VLMs perform significantly below human level on this benchmark. However, a fine-tuned 7B model demonstrates substantial potential, even surpassing humans in spatial transformation tasks, while its poor performance in scientifically-grounded tasks like hydrogen bond recognition underscores the necessity of integrating explicit domain knowledge for progress toward scientific AGI. The datasets are available at https://huggingface.co/datasets/zongzhao/MiSI-bench.
PDF111December 13, 2025