视觉与化学知识的桥梁:通过ChemVLM实现视觉与化学知识的融合
Seeing and Understanding: Bridging Vision with Chemical Knowledge Via ChemVLM
August 14, 2024
作者: Junxian Li, Di Zhang, Xunzhi Wang, Zeying Hao, Jingdi Lei, Qian Tan, Cai Zhou, Wei Liu, Weiyun Wang, Zhe Chen, Wenhai Wang, Wei Li, Shufei Zhang, Mao Su, Wanli Ouyang, Yuqiang Li, Dongzhan Zhou
cs.AI
摘要
在这份技术报告中,我们提出了 ChemVLM,这是首个专注于化学领域的开源多模态大型语言模型,旨在解决化学图像理解与文本分析之间的不兼容性。我们基于 VIT-MLP-LLM 架构构建了这一模型,利用 ChemLLM-20B 作为基础大型模型,赋予我们的模型在理解和利用化学文本知识方面强大的能力。此外,我们采用 InternVIT-6B 作为强大的图像编码器。我们从化学领域精心筛选了高质量的数据,包括分子、反应式和化学考试数据,并将其编制成双语多模态问答数据集。我们在多个开源基准测试和三个自定义评估集上测试了我们模型的性能。实验结果表明,我们的模型取得了出色的表现,在六个涉及任务中的五个中获得了最先进的结果。我们的模型可在 https://huggingface.co/AI4Chem/ChemVLM-26B 找到。
English
In this technical report, we propose ChemVLM, the first open-source
multimodal large language model dedicated to the fields of chemistry, designed
to address the incompatibility between chemical image understanding and text
analysis. Built upon the VIT-MLP-LLM architecture, we leverage ChemLLM-20B as
the foundational large model, endowing our model with robust capabilities in
understanding and utilizing chemical text knowledge. Additionally, we employ
InternVIT-6B as a powerful image encoder. We have curated high-quality data
from the chemical domain, including molecules, reaction formulas, and chemistry
examination data, and compiled these into a bilingual multimodal
question-answering dataset. We test the performance of our model on multiple
open-source benchmarks and three custom evaluation sets. Experimental results
demonstrate that our model achieves excellent performance, securing
state-of-the-art results in five out of six involved tasks. Our model can be
found at https://huggingface.co/AI4Chem/ChemVLM-26B.Summary
AI-Generated Summary