觀看與理解:通過 ChemVLM 橋接視覺與化學知識
Seeing and Understanding: Bridging Vision with Chemical Knowledge Via ChemVLM
August 14, 2024
作者: Junxian Li, Di Zhang, Xunzhi Wang, Zeying Hao, Jingdi Lei, Qian Tan, Cai Zhou, Wei Liu, Weiyun Wang, Zhe Chen, Wenhai Wang, Wei Li, Shufei Zhang, Mao Su, Wanli Ouyang, Yuqiang Li, Dongzhan Zhou
cs.AI
摘要
在這份技術報告中,我們提出了 ChemVLM,這是首個針對化學領域而設計的開源多模態大型語言模型,旨在解決化學圖像理解與文本分析之間的不相容性。我們基於 VIT-MLP-LLM 架構構建了這個模型,並利用 ChemLLM-20B 作為基礎大型模型,賦予我們的模型在理解和應用化學文本知識方面強大的能力。此外,我們採用 InternVIT-6B 作為強大的圖像編碼器。我們從化學領域精心挑選了高質量的數據,包括分子、反應方程式和化學考試數據,並將其編制成雙語多模態問答數據集。我們在多個開源基準測試和三個自定義評估集上測試了我們模型的性能。實驗結果表明,我們的模型取得了出色的表現,在六個任務中的五個中取得了最先進的結果。我們的模型可以在 https://huggingface.co/AI4Chem/ChemVLM-26B 找到。
English
In this technical report, we propose ChemVLM, the first open-source
multimodal large language model dedicated to the fields of chemistry, designed
to address the incompatibility between chemical image understanding and text
analysis. Built upon the VIT-MLP-LLM architecture, we leverage ChemLLM-20B as
the foundational large model, endowing our model with robust capabilities in
understanding and utilizing chemical text knowledge. Additionally, we employ
InternVIT-6B as a powerful image encoder. We have curated high-quality data
from the chemical domain, including molecules, reaction formulas, and chemistry
examination data, and compiled these into a bilingual multimodal
question-answering dataset. We test the performance of our model on multiple
open-source benchmarks and three custom evaluation sets. Experimental results
demonstrate that our model achieves excellent performance, securing
state-of-the-art results in five out of six involved tasks. Our model can be
found at https://huggingface.co/AI4Chem/ChemVLM-26B.Summary
AI-Generated Summary