觀看與理解：通過 ChemVLM 橋接視覺與化學知識

摘要

在這份技術報告中，我們提出了 ChemVLM，這是首個針對化學領域而設計的開源多模態大型語言模型，旨在解決化學圖像理解與文本分析之間的不相容性。我們基於 VIT-MLP-LLM 架構構建了這個模型，並利用 ChemLLM-20B 作為基礎大型模型，賦予我們的模型在理解和應用化學文本知識方面強大的能力。此外，我們採用 InternVIT-6B 作為強大的圖像編碼器。我們從化學領域精心挑選了高質量的數據，包括分子、反應方程式和化學考試數據，並將其編制成雙語多模態問答數據集。我們在多個開源基準測試和三個自定義評估集上測試了我們模型的性能。實驗結果表明，我們的模型取得了出色的表現，在六個任務中的五個中取得了最先進的結果。我們的模型可以在 https://huggingface.co/AI4Chem/ChemVLM-26B 找到。

English

In this technical report, we propose ChemVLM, the first open-source multimodal large language model dedicated to the fields of chemistry, designed to address the incompatibility between chemical image understanding and text analysis. Built upon the VIT-MLP-LLM architecture, we leverage ChemLLM-20B as the foundational large model, endowing our model with robust capabilities in understanding and utilizing chemical text knowledge. Additionally, we employ InternVIT-6B as a powerful image encoder. We have curated high-quality data from the chemical domain, including molecules, reaction formulas, and chemistry examination data, and compiled these into a bilingual multimodal question-answering dataset. We test the performance of our model on multiple open-source benchmarks and three custom evaluation sets. Experimental results demonstrate that our model achieves excellent performance, securing state-of-the-art results in five out of six involved tasks. Our model can be found at https://huggingface.co/AI4Chem/ChemVLM-26B.

觀看與理解：通過 ChemVLM 橋接視覺與化學知識

Seeing and Understanding: Bridging Vision with Chemical Knowledge Via ChemVLM

摘要

Support